£D 227 113 



DOCUMENT RESUME 



TM 820 890 



AUTHOR 
TITLE 

INSTITUTION , 
PUB DATE. 
NOTE 

AVAILABLE FROM 

PUB TYPE 

^ \ 

EDRS PRICE . 
'DESCRIPTORS 



IDENTIFIERS 



Livingston, Samuel A.; Zieky f Michael J. 
Passing Scores: A Manual for Setting Standards of 
Performance on Educational and Occupational Tests. 
Educational Testing Service c Princeton, N.J. 

82 . . 
63p.« 

Educational Test ing&Service , Sox 28*85, Princeton, Nj 
08541 ($7.50). 

Guides - Non-Classroom Use (055) 
MF01/PC03 Plus Postage. 

Academic Achievement? *A£ademic Standards; *Cutting 
Scores? *Educational testing; *Evaluation Criteria; 
Minimum Competency Testing; Occupational Tests; 
Political Issues; Specifications; Standardized Tests; 
Test Manuals , „ ' 

Angoff Methods; Ebel Method? Nedelsky Method; 
^Standard SfettLng 



ABSTRACT * \ 

This manual is written for t?he individual responsible 
for choosing the passing score on an educational or occupational 
test. It concentrates on practical advice to help select and apply 
method for choosing the passing score, decisions, standards, and i 
judgments are defined arid discussed in terms of considerations in 
choosing a passing score irathod. Three how-to-do-it sections discuss 
methods based on judgments about (I) test questions, (2) individual 
test-takers, and (30 groups of test-takers. A section on choosing a 
standard-setting method contains recommendations for choosing .among 
the previously presented- Methods . Social and political i ssues t related 
'to passing scores Are discussed. Helpful hints provide practical 
advice to increase the probability that the passing scores Vill be 
accepted. A bibliography, limited to works published since July 1981 
dealing with the .problem of -setting standards, is presented.^ 
Additiortal calculations required by *he correction for guessing are 
included in the appeVdix. (Authdr/PN) 



************************************ 

* ReproductiQns supplied by EDRS are* the best that can be made * 

* from the original document. .* 
******************************************************************** 




"PERMISSION TO REPRODUCE THIS 
MATERIAL HASj^EN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



UJfc DEPARTMENT OF EDUCATION 
NATIONAL INSTITUTE OR EDUCATION 
EDUCATIONAL RESOURCES; INFORMATION 

J CENTER (ERIC). % 

§f Thfc document has been f#pro<iac«d as 
# received f/om th« person or organization 
originating j t . # 

□ Minor changes have been made to improve 
reproduction quality, 

• Points of view or opm«ns stated wi this docu- 
ment do not necessarily represent official N1E 
pojrtion or policy, 





mMS r ' . ■ 



' : ' Authors' Note 

> • 

We thank William H. Angoff. Ronald A. Berk, Carol A. Dwyer. Robert 
L Ebel. John J, Fremer,, Ronald K Hambleton, Richard M: Jaeger. 
Robert L: Linn, John A Meskauskas, W James Popham. and Benja- 
min Shimberg for their many helpful comments on an earlier draft of 
this manual ^ % 

The opinions we have expressed in this manual are pur own and do not 
necessarily reflect the opinions of our reviewers or the position of Edu- 
cational Testing Service % ' 




Table of 
. Contents 

Puerview. 4 7 

— ^ • 

1 Decisions. Standards. 'and Judgments 9 

f Decisions %. . , 9 

.» Standards 10 

Judgments \ . * 12 

j Tyvo Types of Wrong Decisions 4 . 12 

Methods Based on Judgments About Test Questions 15 

Nedelsky's Method 17 

Angoffs Nfethod : : : . 24 " 

" Ebel's Method . ' 26 

Methods Based on Judgments About Individualist Takers 31 

The Borderline-Group Method . r . 34 

/The Contrasting-Groups Method < , 35 

TheUp-and*Down Method 43 

- Methods Based on Judgments About a*Grbup of Test-Takers. ..... 49 

, Xhoosing'a Standard-Settinig Method '. 53 

Social and Political Issues . . . 55 

Helpful Hints >. 61 

Conclusion .j^ 67* 

Bibliography . . .\\ **-69 

Appendix v \ % , , < 71 



^Overview 

• ** f * 4 

* * 

This manual is written for the person who will be responsible for choos- 
ing the passing score on an educational or occupational test Our pur- 
pose in writing the manual is to help you select and apply a method for 
choosing the passing score. Therefore, we have tried to' concentrate on 
practical advice, rather than discussions of theory or descriptions of re- 
search fmdirfgs For the reader who is interested in those topics, we 
have included a brief bibliography akthe end of the manual The man- 
ual itself is divided into seven sections: i 

1. Decisions. Standards: and Judgments, some key concepts and 
some things to consider in choosing a method for choosing the pass- 
ing score: ' « # 4 ' 

2. Methods Based on Judgments About Test Questions a how-to-do-it 
section: 

3. " Methods Based on Judgments About Individual Test Takers. 
' another how-to-do-it section: 

4. Methods Based on Judgments About a Group of Test Takers' yet 
another how^to-do-it section: 

5. Choostng a Standard Setting Method, our recommendations for 
choosing among the methods presented in the previous three sec- 

r fions: ' • ' • m " l m 

6. Social and Political Issues, a brief discussion of some sources of con- 
troversy over passing scores: 

1. Helpful Hints, practical advice not included in the previous sections 




Decisions, 
Standards, 
; . and Judgments 

» •* 

Decisions : . 



A test score is a piece of inforqriajion about a person . How can you use 
that information .to make a decision? One w,ay is to consider each per- 
son's test score along with other information* about that person^ apply 
your own judgment, and make the decision. This ca^e by-case method 
occasion making has <?ome important advantages, Because you do 
- nfct^ave^to specify your criteria for^the decision in advance, you can 
take account of any relevant^information you may have about the test 
taker, even if you did not originally plan to use it. Case-by-case deci- 
sion making offers ,each test taker the chance to be coffi(dered individ- 
ually as a whole person. However, it also has some serious'drawbaclp. 
It is subjective, in that two different decision-makers can arrive at differ 
ent decisions on the basis of the same information. You cannot ade- 
quately describe your criteria /or the decisipn in the forrn of ^.statement 
to- the test takers and other interested persons. m«sftort, case-by qase 
decision making offers no assurance to the test takers that they will Jbe 
treated fairly. As a result, it can leave you open to charges of favoritism, 
prejudice, or arbitrary and capricious actions. For these feasons, you 
may prefer to use a decision rule that you will apply in the same way to 
all test-takers. Your decision rule will specify what information you will 
be using and how you will use it m making decisions about individual 
test-takers. t i % ^ 

One very simple and very common type of decision rule is \p classify 
th£ test-takers into two groups, a higher scoring group and $Jower- 
* scaring group. Decision rules of this type are used in many differen^test 
ing situations. Hej-e are only a feU/ examples: \ ' "5i 

The higher-'sconng group will go on to another gnit of in^ructionX^ 
the rpwer-scoring group will repeat the previous unit. \, , \ 
- The, higher-scoring group wilt receive a diploma or certificate, the f ' 
lower-scoring group will not. - . \ 

The higher-scoring group, will be licensed to practice a profession, 
the lower-scoring group will not. - , ^ * l 



The tower-scoririg' gr6up will receive some, kind o1 special reme- 
dial instruction; the higher-scoring group wiU not. 
Thg higher-scoring group will be admitted/o a training program, 
the lower-scoring group will not. 

The higher-scoring group will be given credit. for a college course 
without taking the course, the lower-scoring group will not. 
x Even when more complicated decision rules are used, part of the rule 
lifter! involves 'classifying test-takers into a higher- scoring group and a 
lower-scoring group For example, a professional certifying board might 
* decide to grant certification only to persons whd have completed an ac- 
credited training program and have at least twq years' experience in the 
profession and earn at least a specified score on a^ertification test 

To use a test score in these types of decision rules, you must choose 
the test score that will separate'the higher-scoring group from the lower- 
scoring group. The^purpose of this manual is to help you make that 
choice by describing several methods- that you can use * 

In this manual, we will use the traditional terms "pass" and "fail" to in- 
dicate the placing of a test-taker into the^higher-scoring group or the 
lower-scoring group. We wjll refer to the score that separates the two 
groups as the "passing score." We realize that these terms will be inap- 
propriate for some testing situations. However, we believe that our 

manual will be more useful if we use these concise and familiar terms 

i m * 

« • ' 

• . / 

- s 

Standards "_ . ; 

A standard is an answer to the question, "How much is enough?" There 
are standards for many kinds of things, including the purity of food 
products, the effectiveness of fire extinguishers, and the cleanliness of 
auto exhaust fumes. When you choose a passing score, you are setting 
a standard for performance on aTest. 

Choosing^the passing score wcKild not be^a problem If the test-takers' 
scores always fell neatly into two groups, one group of nearly perfect 
scores and one group" of scores % at or near the chance level. Unfortu- 
nately. 4n the real world of testing, we rarely get £uch clear-cut results 
We have to face the difficult task of deciding how rhuch is enough \ 
Standards can be either absolute or relative. A relative standard de- 
pends on comparisons between individuals, an absolute Standard does 
not. In testing, a relative standard depends o"n comparisons among the 
test-takers. The question, "How good is good enough?" is answered in 



terms of the test-takers scores For example, consider the following 
statements: V 

1 "If your score is tn the top 5 percent of fte group, it is good enou^" 
2. "If ypur score is above the average score of the group, it is good 

enough " • 
3 If your score is not more than 20 points below the average score of 

the group, it is good effough " 
4.- If your score is not in the t bottom 2 percent of the group, it is good 

enough." *V * 

Each of these four statements expresses a relative standard *In each 
case, "good enough"-is defined in terms of the scores of the test-takers 
An individual test-taker's score will be compared against a standard tjiat 
depends on the scores of the other test-takers. The higher the other test- 
takers' scores are. the higher the standard will be. In contrast, an ab- 
solute standard is one that does not depenclon the performance of the 
test-takers who will be measured against i£ Pot the person who takes a 
test tha^will be used with an absolute standard, it does jipt matter how 
well the other test-takers do, because their scores will not affect the stan- , 
dard. 

To £now whether a passing score represents an absolute standard or 
a relative standard, you need to know whether the test scores are ex- 
pressed -in absolute or relative ter.ms. To say, "The passing scor.e iff 60 
'out of a possible' 100," tells you little, unless you know what "60" 
means. If if means "60 percent of tfie questions answered correctly," the 
passing score represents an absolute standard ff it means "better than 
60 percent of all the test-takers'," or "twp standarcLdeviations* below* the 
average score of the test-takers," ifye passing 'score represents a relative 
standard. 

Choosing a passing score to represenfa relative standard is not diffi- 
cult, you cfioose the score that passes the desired number or percentage 
of the test-takers. For example, if the test is being used to select students 
f^n advanced course that is limited, io thirty students, the passing 
score will be the score that passes exactly thirty students This manual 
will concentrate on methods for choosing a passing score that .repre- 
sents an absolute standard. 



'The 'standard deviation" is a measure of how widely the .scores of a group of lest takers 
a/e spread out olong the test score scale 



Judgments - 

« * 

Any standard — absolute or relative— is based on some t^pe of judg- 
ment. A standard is aj\ answer to the question, "How good is good 
enough?" and this question can be^answered only by^someone's judg 
raent. The choice of a passing score will involve judgments at some 
pomt in the process It is important that these judgments be 

made by persons who are qualified to make them. 

(2) meaningful to the persons who are maRTrlgTfiem. and 

(3) made m a way that takes into account the purpose of the test. 

These three requirerrfents are interrelated. Different methods for choos 
mg a passing score require different types of judgments and, therefore, 
somewhat different qualifications for the judges In describing each 
method, we will describe the necessary cfualifications for the judges, and 
we will suggest ways to get them to keep the purpose of the test in mind 
'when they are making theirjudgments. 



Two Types of Wrohg Decisions , 

* » • ... * - 

Whenever you use a test to classify the test takers into two groups, two 
* kijids of wrong decisions can occur; 

1. A test-Jteker who actually belongs in thejower group can get a* score 
above the passing score. 

2. A test- taker who actually belongs in the higher group can get a score 
below the passing score^ 

These wrong^ decisions occur because tests are almost never perfect 
measures of the knowledge and skills they are intended to measure A 
test taker's skills may vary from day to day and even from hour to hour 
A test taker may guess at some of the questions, and there is no way to 
distinguish a lucky guessTrom an answer that the test taker really knew. 
For mosUests/the questions qr problems do not include every item of 
knowledge and every possible application of the skills that the test is in^ 
tended to measure. The cfuestions or problems are only a sample of all 
those that could have been included, and they may give a misleading 
picture of the skills of some of the tesHakers\ ' 

12 6 

10 



For allthese reasons, on most tests tt is impossible to choose a passing 
j>core that will completely eliminate wrong decisions. You can reduce , 
the chance of passing a test-taker who should fail by using a higher 
passing score. However, by doing so you will increase the chance of 
faihng a test-taker who should pass. Similarly, you can, reduce the 
chance of failing a test taker who should pass* by using a lower passing . 
score, but you willjncrease the chance of passing a test-taker who 
should fail. Improving the test will reduce the number of wrpng deci- 
sions but will not eliminate them entirely. , i' 

If either type of wrong decision were of no consequence, you would ' 
not need to use a test, you could simply pass everybody or faiLevery- 
body. For example, if passing an unqualified test-taker would, do no , 
harm at all. your best decision rule would be to pass everybody. The 
method you use to choose the passing score should take both types of 
possible wrong decisions into account. # , 



Methods 
Based on 
Judgments About 
' p A; Test Questions 

The three standard -setting methods we ^describe- in this section of the 
manual are based on the concept of the "borderline 1 ' test-taker. This- 
. % test-taker Is the one whpse.kaowledge and skills are on.Jhe borderline 
between the upper group and the lower group. Thes*e methods are 
based on the idea that, singe the test takers wjio^belong in the upper 
§toUp, w\\\ te«d to earn higher Scores than those who belong in the lower 
group t the passing.«score should be the score that would be expected 
from, a person whose skills are on the .borderline.* The judgments these 
methods require are made in terms of the specific questions on the test. 

Tbese^methods are relatively convenient and can be applied either 
before o& after the testis administered. In addition, the process of mak- 
ing judgments about test questions focuses the judges' attention dosely 
op the CQ&jtjenf of the test^ Most important, -the neqessary data— Judg- 
ments about test question's— tan nearly always be obtained. However, 
thetype of judgment these methQC^call for is not simply an evaluation 
of someone^ jperformance, that the judge can observe,* Instead, these 
methods call fo\ a much more difficult .type of judgment. The judges 
myst decide how a borderline" test-taker woujd be likely to. respond to 
each of the questions on thelest Because of the hypothetical nature of 
these.judgments, we believe that these methods need a Reality check. 4 * 
If you use one of tJhese methods., you should supplement it with some 
kin'd.o? iniofrpalion about the actuaijest performance,^ real test takers, 
if you pqssitjjy ca/i. And if this additional information clearly- indicates t 
that the Vesufts % of the methocf do not describe the performance of a bof 
derline test taker, you^ should be prepared to admit that the method, 
may not have worked" well a*rid to choose the passing score in some 
other way. * 

*The earliest article^describiiTg one of, these method^ (Nedelsky. 1964) referred to this 
person as the "F-Dstudent " • / " * tf 0 

15 



■Ea^h of these rriethods consists of five basic steps: 

* h\ Select the judged; 4 * 
%, Define "borderline" knowjedge and skills: • 
3. Train the judges in the use of the method you have chosen; 
,4. . Collect judgments. 

,5. €ombipe the judgments to choose a passing score. 

The first twoiteps are the sarrte for all methods. The remaining steps 

* differ. . . * ^ * 
The first step in "any of these methods is to sdect th^judges The 

judges must be qualified to decide what level of the knowledge or skills 
' measured by the test js necessary. For example, if a test of occupational^ 
.knowledge is being used as a requirement for a nuclear pow&rlplant 

* operators license, the judges nrtJst be qualifed to ^decide how much 
- knowledge is necessary to protect the public against operator errors that 

could result in a nuclear accident. If a reading test is being used as a re- 

• quirement for high* school graduation, the judges must be qualified to 
.decide what a high school diploma should' indicate ahput a person's 
reading ability.' In some cases; only a few people may be qualified to~* 
serve as judges, in other cases, 'many may be qualified, If only a few- 
people have the necessary qualifications, and if it is possible for all of 
ttiem.to patticipate as judges/try to include them all. Otherwise, try to 

\ v ^tyake siirethat the judges who participate are typical of all persons; cjual- 
' ifie4,to be fudges. AH important points' of view should fee represented op 
the pand£f judges. * 4 
How many judges should you select? If you have too few, the process 
. , may be^too greatly influenced by one or two individuals with unusually 
high or- unusually low^andards. In, this respect, the more judges, the 
4 better. But the.mor§ judges' you already have, the less you, will gain 
frorrtadding one more judge/We have used these methods with as few 
, as Jlv^judges, butjn these cases, the results were to be taken.as'a rec- 
ommendation,, not as a final determination. vVe suggest you try to get 
, < mor£ if you possibly can . ° • 

Although it is possible to ipply these methods without having the 
judges communicate directly with each other, we strongly recommend 
. that you bring the judges together aU meeting (If you have more than 

* 20 judges, we suggest you divide them into smaller groups and work 
with each group separately.) At this meeting, ypu <fan have the judges 
define "borderline" knowledge and skills, and ybu carUrairi the judges 
in applying whichever passing score selection method you have chosen,. 
To, define "borderline" knowledge and skills, first make sure the judges 

16 / , , 

•V 13 



un£kfrstand what }h*e test measures.and how the test scores will be used. 
TKen ask the judges to d^scnb^) in their own wordsfa person whose 
knowledge and skills v woulft reftte fig n t the borderline .between accept 
« able and unacceptablejgwlfc of the knowledge and skills the test mea- 
/syres. The judges may find it convenient to describe trie perforrrjance of 
f specific people they nave foorfied with, whom they would classify as 
f "borderline M You can help the process along by asking appropriate 
questions. For example, if the rest is a reading comprehension test that 
is tjemg us£d to identify high school students w^io need further instruc 
r f tion in reading, you rfiight aSk, "Should the borderline test taker be able 
to find specific information in, a newspaper article? To distinguish itate- 
ments of 'faqt from statements, of opinion? Should the bdrderlinestest- 
> ' taker be able to recognize ttjft main idea of a Paragraph, stated {n differ 
ent words, ikthe paragraph is from a Redder s Digest article? How about 
a paragraph frprn an articfp in NewsweeK? Hqw .about Scientific 
AmencanT , • > y 4 • 

Allew the judges plenty of time to agree on.a definition of borderline^ 
knowledge and skills. If there.are strong differences of opinion th&t can- 
not be, resolved -by a compromise, you may have to proceed without a 
„ single definition that the entire panel of judges can agree orr. J3ut try to 
g«t 'agreemerlhf you possibly £an. When thje judges have agreed on a 
definition, write it down^complete with examples, so you will have a 
„ .statement in words of, the standard that the passing score is supposed to 
represent. t , 

Frpm this point on, the methods differ. The three methods we will 
" describe Ae named for the people who first suggested them in .books 
and articles abouCeducational measurement. The methods are known 
as "Nedeflsky's method/' "Angoff's method," and "EbeVs method/' 
. Each pf the three methods-requires a different type of judgment. 

♦ 

Nedefsky's Method > . \ . . ; 

This method, suggested by Leo Nedelsky in 1954, can be used, only 
•with multiple choice tests, since it requires a judgment about eacfh possi 
ble wrong answer. The judges task is to look at the question and iden- 
tify the wrong answers that a borderline test takdr would be able to rec 
agnize as wrong, that is, as not the frestof the answers presented. For 
example, consider the following question, from a test of language, skills. 
The jest -taker's task is to expose the word or phrase thap&st- completes 
the Sentence. 



. "lAy musicjfcacher thinks that Marian Anderson sings r _. any 

other contralto he has ever heard" 
(A) more well than (B) better than J* • / 

(C) the^best of (D) more better over 
A judge might decide that the borderline test taker would be able to 
" eliminate wrong answers A and D But the judge might decide thaj the 
choice between wrong answer Cand thetorrect answer B is too difficult 
for the borderline test taker The judge would then identify answers A 
and D as being so dearly wromf that the borcjjlhne test taker would be 
able to recognize them as wrong. 



, ' Collecting tlje Judgments 

Should the judges make their judgments individually or try to reach a 
consensus? The method seems to *work fairly well either way, if the 
number of judges is n<?t too large. But even with £ small number of 
judges, it may'take some timet© get a consensus on each test question, 
and with more judges, it will be even harder to get them to agree. Yet, 
we believe that the judges can make more v/aHd judgments if they share 
information and opinions with each other Therefore. we*fecommend 
the following group procedure: . *" ^> 

1. Have the#judges make a set of preliminary judgments for all the 
questions, working individually and using a pencil to mark the 
wrong answerslhe borderline test taker would be able to eliminate 

2. Conduct a brief discussion of each question, u^ng the following for 
Oiat: . f # 

a. Focus the judges* attention on the first wrong answer. Ask how 

many of them thought the borderline test-taker would be able to 

eliminate it as not the best answer, and how many did not think 

.so. 4 ' * 

♦ 

b. If the 1 judges are not unanimous, ask one judge who marked the 
answer to explain why. Then ask one judge who did not mark 
that answer to explain why not. Do not try io reach agreement; 
just allow each point of view to be heard. The judges may'or may 
not be swayed by the comments of their colleagues. Tell th^ 

, judges they may change their judgments 1 if they want to. Make 
^ sure the judges understand that their judgments are supposed to 

describe the performance of a borderline test-taker/ 

c. Go on to the next wrong answer. 

3. After ?ll the questions have been discussed in this manner, ask the 
judges to review their decisions a*d make sure they have marked all 
the wrong answers they intended to mark and on/y those answers. 



•4-' Collect the judgments. •• - • . '« y *~r£- 

To save time, you can use a shortcut tension of this technique in 
which you consider each question as a whole: - # 

1. Ask how many judges eliminated all the wrong answers % 

2. Ask how .many judges eliminated the first* wrong answer, how marty 
eliminated the second wrong answer, and soon. 

3. Ask for one of the judges to explain his or her reasoning in deciding 
; which wrong answers to eliminate. * v 

4. Ask for one of the judges who made a different decision to exfdain 
his or her reasoning. > * 

5. Allow~discussion as" long as "the discussion seems to be procluctiv^ 
# Then refhind the judges that they can change their judgments if thejT 

- ; want to. -'-4^ * _ * ' . 4 £ 

' 5 %\ (^ ontothe n.extquesfioipi. * \ k % ' K , • J 

You may find Jt useful to blgin by discussing ^ach twrong answer and^ 
then switch, aftef' a few questions, to discussing the question as a whole*/ * 

One limitation of thij procedure is that it requires all the judges to 
make their judgments at the same time and place. Another limitation is 
that, even^with the shortcut, it is fairly slow (though nofnearly asjslow as 
trying to Set a group consensus on each question). For either of these 
reasons, you may find tt necessary to have the judges make their judg- 
ments individually, without communicating with each other. If you do t ^ 
remember that making this type 'of judgment will'probably be an unfa - ^ 
miliar task for the judges. \i possible, you should give them the chance 
to practice the judging task on a sample of, the questions and discuss, 
their work with each other before judging the rest of the questions (This 
is the procedure Nedelsky recommended.) ^ ' 

Som€? types of multiple-choice questions present problems in usyig 
Nedelsky's method. One type that can cause problems is the negatively 
worded question, like the following-example: 



Which of the following foods is not a source of vitamin C? t ^ 

(A) milk* .(B) orange juice (C) raw cabbage (D) baked potatoes * 

ln,deci*ng vvhat wrong answers to mark, the judge must remember that 
the better a source of vitamftjvC a foodJs, the worse an answer to the 
question it iSj and therefor^ the more likely the borderline test-taker 
would be to recognize it as wrong. 
Another type of question that can cause problems with Nedelsky's 



J meth^rfFl^mMJtip^ tme^ajse" qyigltion, such/s. the jpllowi'ng^ex- 
ample:^ » 
Which country or countries 'did the United States fight against during 
World War!!? ' « 
I. Germany, 

• ' !!. Russia ■ 

III Italy • 

* IV. Japan " 0 
{A)! only (B) I! only (G) I and J V only , . • . 

-(D) I. HI. and !V only (E) I, II, III, and IV 
This question is really four true-false questions, and 'the judge-should 

* deal with it that way. First the judge should decide which of the num- 
bered choices th£ borderline test-taker would identify as correct, which 
choices the borclerline test-taker would identify as incorrect, and which 
choices the borderline test-taker would'be unsure about. Then the judge 

* ~ can ficjure oul which of the answer choices- (A, B, C, D, E) the border- 

line test-taker could eliminate. In the example, suppose the judge de- 
cides that the, borderline test-taker woujd know thyat yfigr many) and IV 
■ (Japan) are correct and* that II tfyj$sia) is wrort j. J^ the borderline- 
test-taker could eliminate afiswer choice A, becaTflj^^Kes not include 
Japan, choice B, because it does not include jG^JpPor Japan and 
* ■ dois include Russia, and choice E f because it includes Russia. 

If you detide'to use Nedelskys method with a test that contains nega- 
tively worded questions or multiple true-false questions such as those in 
the examples above, be stlreto give the judges plenty of practice at 
judging those kinds of questions before they, begin making their judg- 
ments individually. Make sure'th£y can follow the logic of the judging 
process. When they havte finished making their judgments individually, 
'askihem to explaifi the reasons for their judgments on at least some of 
those questions, to make su& theirmarks are what they really intended 

* Another typeof question that can present difficulties in using Nedel- 
sk?*s method is the question that requires the test-taker to do some 

, mathematical computation. The wrong answer choices to these ques- 
tions usually are the results of pommon mistakes. The difficulties arise 
because the type of mistake that a wrong answer choice indicates is not 
: , always obvrpus. Therefore, thejudges may have a hard time^cteciding 

* . whether or not a borderline test-t^ker would have selected a particular 
. wrong answer. Even the best qualified judges m^y.find it, time-consum- 
ing to figure out .what kind of mistake would lead to each wrong answer 
You can- avoid this problem by giving the judges ajropy of the test that 
shows the types of mistakes thtft lead to 'each wrong answer choice For 
example, consider the following question: 

20 • 



ERIC ✓ : 1 



A worker dropped a,hdmm|r eil the roof of a building 36 Jeet high 
How long did it take the harrlmer to reach the ground? (Useg = 32 ) 
(A) 1.06 seconds (B) 1,125 seconds 
(C) 1.5 seconds' (D) 2,25,seconds 

A judge jmgrrt know that the correct formula is s = J / 2 gtf, which • 
leads t6 answer C, ancf yet have trouble figuring out where the wrong 
answ&s A, B, and D came from The judge's task would be much easier 
and faster* if the answers on his or her copy of the test were marked as 
follows: , m 

(A) 1.06 seconds U = gt 2 ) 

(B) L 125 seconds (s = gt) # - 1 

(C) 1 .5 seconds (s = 1 /2'gl 2 ) ' " 

(D) 2 25 seconds (s=l/2gt) , / 

One important issue in the applicatton of Nedelsky's method (and 
Angbffs and Ebel\rBethods, a(so)'iS whether or not to tell the 'judges 
the correct answers to the test questions. Giving the judges the correct 
answers may make tfie questions seem easier than they are and, there- « 
fore, bias the judges in the {lirectiort of a higher' cutoff score. If you do 
not give the judges the correct answers, they may judge some of the 
correct answers to be wrong arfswers that a borderline test-taker would 
eliminate, but this informatien can be valuable If several judges elimi- 
nate the correct answer ]p the same question, that question may be de- 
fective. And if one judge eliminates many of the correct answers, that 
judge'may be unqualified, ,« 

However, if you do not give the judges the correct* answers, the 
judges may feel that they are being tested and rriay fbrget that their 
judgments are supposed to indicate the responses of a borderline test 
taker. In addition. Jhe judging process wit! surely take longer if the 
judges have to takfe the extra step of figuring out the -right answer to 
each question. A good^ solution, if your situation permits it, is to have 
the judges, take the test be/ore the judging session and then give them 
the correct answers to use while they ate actually making their judg-* 
ments, u , <*• 



Chposing the Passing Score 

Nedelsky's method is, based on the idea that the borderline tesj-taker 
responds to a multiple-choice question by first eliminating'the answers 
he ox she recognizes as wrong and then guessing at random from the re- 
maining answers. If the test is (o be scored without a* correction for 

, ' 21 



guessing, it is relatively easy to (md the score that such a test-taker 

would be expected to get. by applying the following rules. 

1 Under Nedelsky's method, th^ test-feker's expected score for any 

question is^l divided by the number of answers the test-taker has 'to 

guess from ► j. * 
•2 To find atest taker'sT.'exp^ed score for the whofe test, add up that 

test-takef s exj^ted^Jkes for 1 all the ihdividual questions. 
For example, if the borderline test taker has eliminated $ but three pos 
bible answers, he or she has one chance in three of choosing the correct 
answer Therefore, his or her expected score for that question is 1 di- 
vided by 3. or 33 .Table 1 shows an example of these calculations for 
one judge's judgments on a ten-question test. * 
** Iflhe tesj will'be scored with a correction for guessing, an additional 
calculation is necessary. This'caleulation 19 explained in the Appendix. 

The calculations we have just described^ will give you a separate result 
for each individual judge How should you combine these scores? One * 
way is simply to average the scores in the usual way: add them up and 

' ' * . ' .i - ■ 

Table 1. Example of.ealqulations for Nedelsky's method applied 
to a test'scored itftthout correction for guessing * 



Question 



Answers * 



Number of 
answers not 
eliminated 



Expected 
score 



01 'a(b)XXX 2 

2, x'XXX® '1 

3 XX C ®X ' : (2 

4 A X C (D)X * 3 r 

. 5 , m ®xxxx 1 

6 * *A B © D E 5 

7 A B C'X® "4 

8 @ B X D E >4 

9 A (B) C , D E 5 
1*0 A (5) C D , E 5 



1/2= .50^ 

1/1 = 1.00 • 

1/2= ,50, 

1/3= .33 ' 
1/1 = 1.00 , 

1/5= .20 

1/4= .25 

1/4= .25 f***tf 

1/5= .20 

1/5= .20 



Sum = 4.43 



Expected total score = 4.43 - 



*A circle indicates the^orrect answer an X indicates an answer the borderline test taker 
would eliminate* « 



22 



19 



ERIC 



divide by- the number of judges. This type of average is called the mean 
The disadvantage of using the mean is that it allows one judge witft a 
very high or very low passing score t<S^ave a large influence on the 
result. A second way to combine the scores is to take *he median TV 
find the median, first place the scores in order from highest to lowest. (If 
jwb judges arrive at the same score, be sure to list it twice, once .for each 
judge ) If the number of judges i^an odd number, the median is simply 
the middle score If the number of judges is even . the median is halfway 
between the two middle scores. The disadvantage of using the median 
is that it disregards a greaj deal of information by focusing entirely en 
the middle score. A third ,\*ay to combine the scores represents a com- 
promise between the mean and the median. It is called the trimmed 
mean. To compute the trimmed mean, simply eliminate the highest and 
lowest scores and average/he remaining scores? in the usual way De- 
pending on the number of judges, you may choose to eliminate the 
highest two scores and the lowest two scores, or the highest andjowest 
three scores, or more. How much "trimming" to ^o is up to you* If you 
are going to use the trimmed mean for averaging the scores, you should* 
let the judges knowlhis fact before/ou calculate the passing score from 
their judgments. Otherwise, theiuHq^s with the highest and'lowest st^n- 
* dards'may suspect that you -4flBfctimina ting against them Table-2, 
shows an example of these thcee ways of combining the scores from the 
individual judges to choose a passing score This example was con- 
structed to show a case in which the three ways of combining scores 
produce very different results, inmost cases the differences will not b^ 
as1argeastheyareinTable2. , 

, s \ * i 

'Table 2; Example of three ways \o cbmbine 'scores 
• from individual judges 



Judge 1 (highest) 

2^ 
Judge 3 

Judge 4 9 4 
Judge 5 (lowest) 



92.50 ' 
77.25* 
67. 00 
66 k .67 
65.33 S 



* Sum* 368.75 



Judged 77.25 
Judqe^ 67.00 
Judge 4 66.67 \ 



Sum = 210.92 



Mean = 368 75-5 = 73.75 
Median = 3rd highest = 67.00, • 
Trimmed Mean * 210 92 - 3= 70.31 



' One fairly common practice is to eliminate the highest 25 percent and the lowest 25 per- 
cent of the scores and average the middle 50 percent The resulting statistic is called the 
"midmean" 

23 



20 



When you have collected the judgments.* computed the resulting 
score iof each judge, and combined the results, you will have a consen- 
sus judgment of the score that a borderline test-taker would be expected 
to get on the test Of course, even if thisjudgment is correct, not every 
borderline test taker would get this exj'ct score every time he or she 
takes the test Rather this expected score represents the score that is 
typical of a borderline test takers performance if you choose this score 
as the passing score. & borderline test taker should have a 50 percent 
change of passing the tesf (if the Nedelsky type judgments actually do 
describe the way such a test taker would perform on the test). There- 
fprV in a fairly large group of bprderline test takers, about half would 
pass theiest and ajbout half would fail , 



Angoff s Method, . * ' 

This methoa\ suggested by William H. Angoff in 1971. is similar to„ 
Nedelsky'^ method, but it can be used with tests' that are riot multiple- 
, choice In Angoff s.method. the passing score /computed from the ex- 
pected scores for Vjie individual questions. /as in Nedelsky's method 
However. Angoff s method does not require the judge to consider each 
possible, wrong answer separately. Instead, the judge considers each 
question as a whole and makes a judgment of the probability that a bor- 
derline test taker would answer the question, correctly, This taslTrft&^be 
difficult for some judges. If the judges are not comfortable about making 
judgments in terms of probabilities, ask tnejm to imagine a group of 100 
• borderline te*st takers and decide how many of therh would answer the 
question correctly 'Obviously, the easier the question, the/higher this 
number w# be, The probability must be bltween 00 a'nd 1\00 If the 
, questions are multiple choice, the probability should ordinarily be at 
le^st a'b.Iarge ai> tfte, Qhance df guessing th^ferrect answer by/ohnd luck 
(that is, LOO divide'd.by the number of chjjfts). 

, . 1 - . * Saf 

Cdlfecting the Judgments ' 

Should the judges make their judgments-individu^IIy or try to reach a 
Consensus 7 Again, we recommend a compromise procedure. 
1 Have the judges make preliminary judgments for the first few ques 
tionsonly 

2\ Conducts bne[ discussion' of each of these questions, using the fol- 
. lowing format: . * 

24 



\ 



s 



Have each judge announce his or her choice of a probability for 
each- question. Write these, numbers on a blackboard or a large 
Sheet of paper so all the judges can see them. If the numbers are 
all similar (e.g., within K) or 15 percentage points), go on to ifre 
next question. % , 

\b. If the numbers are not all similar, ask forNa judgewho chose one. 
s of the highest numbers to explain the reasons for choosing a high 
probability. Then ask for a judge who chos^ one of the lowest 
numbers to explain the reasons for choosing alow probability, 
c. Tell the judges they can change their judgments if they want to. 
Make sure the judges, understand that their judgments are sup- 
posed to describe the performance of borderline test-takers. 

3. After discussing the first fgw questions, have the judges -make pre- 
liminary judgments for the remaining questions. 

4. Discus! the remaining questions*as in step 2, and give the judges a 
chance to change their judgments if they want to. 

5. Collect the judgments. • y - 

Some people have used a modification of Angoff s method in which 
the judges are presented with a selection of probabilities in multjjple- 
chbice format and asked to circle one of the choices. We do not recom- 
mend this method, for two reasons. First, it can bias the judges' choices, 
particularly if the choices at one end of the scale are very limited. Fo* 
example, suppose the judges are required^o choose frdm the following 
list of probabilities: , - A 

.10" .20 .30 .40 .50 .75 

A judge who thinks that all or nearly, al^bofderline test takers would an 
svter.tfre question correctly has no way to express that opinion. Second, 
limiting the" judges' choice of'probabilities is contrary to the logic of Ar* 
goffs method. If you believe that the judges can make valid probability 
judgments, you have no reason to restrict their choice. If you do not be- 
lieve tr^g judges can make valid probability judgments, you should not 
be using Angoff s method . The restricted choice makes sense only if you 
believe, that the judges can make yalid probability judgments with* this 
ki'rfd of prompting but not without it. 

Choosing the Passing Sqo're 

Finding the expected test score fo* a borderline test-taker is done in ba- 
sically the same way as in Nedelsky's rnethod. If the test is scored with 
out a correction for guessing,, the probability pf a correct answer tsjhe 
test-taker's expected ^core for that question.. Simply add the probabifi 

- . .. " V 25 



22 



- ties for the individual questions to get each judge's estimate of the bor 
derhne test-taker's expected score for the whole test. Table 3*shows an 
example (If the test is scored with a correction for guessing, you must 
*do the additional calculation shown in the Appendix J You can combine 
x the scores you have computed for the individual judges in the same way 
— as for Nedelsky's method, by computing the mean, or the median, or 
the trimmed mean (see pages 22-23) . 

Table 3. Example of calculations for Angoff s method 
applied to a test scored without correction for guessing' 



Question 



Probability of 
Correct Answer 





95 
80 
.90 
,60 
75 
40 

S° 
25 

.25 

.40 



Sum = 5 80 



ed total score = 5.90 



Ebel's Method 



Unlike the previous two methods, EbePs method is a* two stage proce- 
dure*. Each judge first classifies the questions into groups and th'en 
makes a single numerical judgment for each group of questions The 
classification of questions into groups is based on .two kinds of judg- 
rn^frts about each question, a judgment of its difficulty and a judgment 
..pf'its relevance (or importance). E6el suggested* three difficulty levels, 
labeled "easy" "medium/* and "hard," and four relevance categories, 
labeled "essential," "important," "acceptable," and "questionable." The 
judge's first task is to classify all the questions in the test, which will result 
in a classification table simitar to Table 4. '(If you have statistics jndicating 

26 - - 



ERIC 



'J 



the difficulty of each question, you may warrt to make this information 
availafiteto'the fudges to help "them make the judgments of difficulty.)-. 

The judge's second task is to make judgments alpout the performance 
of a borderline test taker. The judge must make one such judgment for 
each of the 12 blocks of the classification table (except for those that are 
empty) That is, the judge must ma^e one judgment for the questions 
* classified*"es!>ential, easy-." another for the questions classified "essen- 
tial, medium." and so on, all the way down to "questionable, hard." 
The judgment consists of an answer to the question. "If a borderline 
test taker had to answer a large number of questions like these, what 
percentage would he or she answer correctly?" Table 4 includes exam- 
ples of these judgments 



Table 4. Example of classification of questipns (stage 1) 
' v - . " and judgment (stage 2) in Ebels method 



1 




- * Difficulty: ¥ : 


' v * 

/ v 


Relevance: 


* Easy 


* Medium 


Hard 


Essentia! 


Questions '1.4.7*8. 13 


Questions '11.15.22 


Question '21 




Judgment 95% correct 


Judgment 85% correct 


Judgment 80% correct 


important 


Qu#aons *2 6 9 


Questions *L0 14.20 


Questions '16.25 


s 


Judgment 90% correct 


Judgment 75% correct 


Judgment «60% correct > 


'Acceptable 


Question *5 


Questions' 12 J8 


Questions * 19.23 




. Judgment 80% correct 


Judgment 55% correct 


Judgment 35% correct 


Questionable 


Question *3 


Questions none 


Questions '17 & 




Judgment 50% correct 


No judgment needed 


Judgment 20% correct 



Collecting the Judgments 

-The group, procedure, that we recommend for Nedelsky's method and 
Angoff s method can be adapted for Ebel's method. However, it will be 
more, complicated, because the judges must make two decisions aboijt 
each test question — its difficulty and its relevance — and must then make 
3 judgment about the borderline test-takers performance on each of the 
12 groups of questions. If you use this procedure for EbePs method, we 
recommend applying it separately to each of thejwo stages of Ebel's 
method. The resulting procedure would be as follows; 
1. Have the judges make a. preliminary classification of the test ques- 
tions into the 12 categories, working individually. 



2 "Conduct a brief discussion of each question, using the fojlowingfor; 
mat: 

*\ a. Ask howmany judge's classified the question as "easy," as 
"medium and as "hard.' v if the judges were not- unanimous . ask 
one judge who classified the question as "easy" to explain why. 
Do the same for ''medium" and "hard " 
b> Ask how many judges classified the question as "essential." as 
"important." as "acceptable." and as "questionable " If the judges 
' are not unanimous, ask one judge who'chose each category to 
# explain why 

c. Give the judges a ch^hce to reclassify the Question if they want to. 

3. Have the judges make;.a preliminary judgment, for each, of the 12 
categories, of the percentage of such "questions a bordelflne test- 
Taker Would answer correctly. 

4. 'Conduct a brief discussion for each of the 12 categories, using the 
following format: *\ v ^ , 

a. Have each judge announce his'or her choice of a percentage for 
that category. 

b. Ask a judge who chose one of the highest numbers to explain &e 
reasons for choosing a high percentage. Then ask a ju/dge 
chose one of the lowest numbers to explain the reasons for 
choosing a low percentage. 

c. Tell the judges they may change their judgments if they want to 
Make sure the judges understand that the judgments are .sup- 
posed to describe the performance of a borderline test-taker,! 

5. Collect the judgments, - 

Choosing the Passing Score 

To find the expected test score for a borderline test-taker, use the fol- 
lowing procedure: • 

1. Multiply the judged percentage correct for the first category ("essen- 
tial, easy") by the number of questions in that category to get, the 
test-taker's .expected score for the first category.^ 

2. Repeat step 1 for each of the other 11 categories. 

'3. Add the expected scores for the .twelve categories to get the ex- 
pected score for the whole test. 

Table 5 shows the calculations based on the classifications and judg- 
ments in Table 4, (If the test is scored with a correction for guessing, you 
must perform the additional calculation shown in the Appendix.) You 

28 



25 



can combine the scores, you have computed for the individual judges in 
the same way as for Nedelsky s method or Angoffs method, by com* 
.puting the mean, or th? median,. or the trimmed mean (see pages 



-Table 5. Example of calculations for Ebel's method applied to£ 
test scored without correction for guessing 





Percentage 


Number of 


Expected 


score' 




Correct 


Questions 


for category 


Essential i 








4.75 


Easy 


95 


5 


.95x5 = 


Medium 


85 


3 


85x3 = 


2.55 


Hard 


80 


l 


.80 x 1 = 


.80 


Important 






.90x3 = 


2.70 


Easy 


90 


3 


Medium 


75 


3 ' 


.75x3 = 


2.2fe 


Hard 


60 . ' 


2 


.60x2 = 


1.20 


- Acceptable 






.80x1 = 


.80 


^ Easy 


80 , 


1 


Medium 


55 


2 


.55x2 = 


1.10- 


Hard 




2 


.35x2 = 


.70 


Questipnable 


i 




.50x1 = 




Easy 




1 


!50 


Medium 


• 


0 




.00 


Hard 


20 


2, 


.20x2 = 


.40 








Sum = 


17.75 


Expected total score = 17.75 









'Information not needed-no questions classified into this category 



: \ 

\ 

1 



, 26 



29 



Methods 
Based on 
Judgments About 
Individual Test-Takers , 

The methods presented in this section are based on information about 
individual test-takers. They require two types of information about each 
test-taker. (1) the person's test score, and (2) la judgment of the ade- 
quacy of the test-taker's knowledge anU skills. These mejhqds include 
the "borderline-group" method, fhe u conVasting-g*oups" ^ethod, aricj 
a variation of the contrasting-groups method called the u up : and-down" 
method* The main advantage of these methods is that people in our 

* society" are* accustomed to judgirvg^other peoples skills asadequaie or 
inadequate for*some purpose especially in educational artd occlma- 
tional settings. Teachers judge the skills of their students, supervisors 
judge the skills of theworkers^tftey supervise, an.d professionals judge 
the skills of their colleagues. Therefore, making this type of judgment is 

" likely to be a familiar and meaningful task. 

The judgments used in these methods should meet the following four 
requirements: . « 

1.. The judgments must be made* by persons who are qualified to make 

them; * ■ . . , 

2. The judgments must be judgments of the knowledge and skills the. 
test is intended to measure; ■ ' 

3. The judgments must reflect the test-takers' skills at the time of tot- 
ing; • ( 

4. The judgments rn.ust reflect the judges' true opinions. 

The first requirement applies to any method of choosing a passing 
score: the judgments must be made by qualified persons, With meth- 
ods based on judgments of individual test-takers, two kinds of qualifica- 
tions are necessary. (1) the judges must be able fo determine each test- 
taker's knowledge and skills, and J&) the judges must know what level of 
knowledge and skill a person passing the test should have. It is impor- 
tant that the judges have both these qualifications. If you cannot find 
judges who have both, you maybe able to design the standard-setting 
process so as to provide the inforrrmtion that the .judges lack. That is, 
you can choose judges who are familiar with the test-takers' knowledge 

3V 



21 



and skills and make them aware -of the level of knowledge and skills that 
will be required. Alternatively, you can choose judges who understand 
ihe level of knowledge and skills ss^mgd and give thefn the.opportu 
nity to observe the/test-takers' knowledge and skills. 

If the test takers are students, their teachers or instructors may be able 
to provide informed judgments of their/kfiowledge or skills. In this case, 
it is a good idea to tell the teachers not to make any judgment pf a Stu 
d#nt whpse skills they have not had the chance to observe adequately. 
The same principle applies w+ien you are asking supervis6rs 4 to judge 
the'workers they supervise, or when youjire asking test-takers to judge 
their peers': 

In some,cases the test takers themselves may^aiovi~d£\the judgments 
of their owh knowledge and skills. For example, s^pose^K instructor 
wants to use a math test to determine whether students' rrfath skills are 
adequate for a technical training course. The instructor could Siv£ the 
test ig all the students at the beginning of the course the fijsftlnte it is 
given. After the students have, progressed far enough in\(ne course to 
need those skills, the instructor could, ask the students to t^aJtea judg 
* rn^nt. "Do you feel that yourAaath skills at the time ^ou\began this 
course we£e adequate for th,e1|Hirse?" The instructor coyJdNhen use 
those judgments to set a passmSBcore on the test for* the next group of 
Students applying for the coufseTOptice that in this example the stu 
dents would meet both qualifications for judges. They would be aware 
of their own skills and of the lev^Lofkfyli required « > 

If the judges are not already familiar with the test takers* knowledge 
and skills, you will have to give them ^ chance to observe a demonstra- 
tion or an example of the product of each4est taker's knowledge and 
.skills. For example, if the test-takers are x ray teqhnologists, tfie judges 
can observe their procedure and inspect some of the x ray pictures they 
have taken. While you may noft^.ab!e to arrange for observations of all 
th,e test-takers, you may be able to get observations of a sample of the 
test-takers: • . , < 

What if the test itself is the best .available indication cJ the test- takers' 
skills? In this case, the judges can base their judgments on an observa- 
tion ofthe test takers afctual test performance — not the test £Core, but ^ 
the performance itself. For example, when an £ssay test i§ Used to test 
students' writing skjj^ the judges can read the students' assays. For a 
test of foreign language speaking abikty or musical performance, the 
judges can listen 1 to the actual performance, or a portion of it {either live 
or recorded) The same principle applies to any performance test that is 
objectively scpred. * 

A second requirement is that the judgments must be based on the 

32 y 



skills and- knowledge the test is intended to measure. The problem is 
that judgments of individuals' skills may ^e affected by factors that are ir- 
relevant to the purpose ofOhe test. Fcjr example, teachers .who are 
asked to judge their students' skilla in English composition may allow 
their judgments to be influenced by the students' understanding of lit 
erature, their penmanship, their punctuality in completing assignments, 
their class participation, and so on. Instructions to the judges can help to 
reduce the influence of these irrelevant factors. The judges must under 
stand clearly which characteristics of the test-takers they should judge 
and which they should* disregard. 
A third requirement is that the judgments must reflect the test-takers' 

. skills at the time of testing, K the judgments are based on the judges' fa- 
milianty with the tefet-takers knowledge^nd skills, the judgments should 
be made as close to tfie time of testing as possible. If the judgments are 
based on a spectal.observatier^ the performance that the judges obseK/e 
should be done as close to the time of testing as possible (If this perfor- 
mance is recorded in some way, it can.be observed and judged at a later 
time.) 1 

There is one exception to this requirement. If the te§t is intended to 
predict the test-takers' skills ar/somQ future time, then 'the judgments 
should be made at that future time* For example, if a test is intended'to 
predict success in a training course, the judgments would have to be 
made at the end of the^training course. 

A fourth requirement is that the judgments must reflect the judges' 
true opinions. It is important to make sure "that the judges have no per- 
son incentive to be especially strict or especially lenient in judging the 
test-takers* skills. For example, when teachers are being asked forjudge 

-their students* skills,, the teachers may suspect that their judgments will 
be use<ito evaluate the effectiveness of thejr teaching. The best precau- 
tion against this sort of misunderstanding is to make sure the judges 
understand how'tfieir judgments wikbe used. They should realize th% 
by participating in the standard-setting exercise, they are assuring that 
the passing score will reflea their own individual standards. 

We strongly recommend that the judges not know the test-takers' test 
scores until after the judging process is complete, Even i{*4ie t judgments 
are ba^d on a performance that is pfcr? of the test itself, they should4>e 
judgments of the performance, not of the test scores. The danger is {hat 
a judge who knows the te&Uakers' scores may ,use the scores of the fir^t 
few test-takers to establish a standard arid then judge the rest of the test- 

> takers by comparing their test scores with those of the first few If the first 
few test-takers are not typicaf, all of the femaining judgments will be dis- 
torted. Butlf the judges do not have access to the test scores, they will 

" • >" % . '" J , r? ' ' 33 



have to judge each test taker individually, and the standard setting pro- 
cedure will work the way it is supposed to. • 

/ ~- * « • 

- i 

The Borderline-GraupJVletribd - 

This method is based on the idea that the passing score should be the 
score that would be expected from a test-taker whose skills are "on the 
borderline —not quite adequate and yet not really inadequate In this 
respect if, resembles the methods based on judgments of test questions 
However, instead of asking the judges to* make educated guesses, about 
the way a borderline test-taker wou^ perform, this methqd calls for the 
judges to identify actual test-takers as "borderline" in the knowledge 
and skills the test measures 1 . The judges do not have to judge all of the 
test-4akers or even a representative sample of them. They need only 
identify the ones who, in their judgment, best fit the definition of a bor- 
derline test-taker. You then set the passing score at the median score 
(the 50th percentile) of this "borderline group." The main advantage of 
this method is its simplicity. It is e&sy Ufuse and easy to explain. The 
mam disadvantage of this method^ that bofderlifte test takers usually 
are a small percentage of alt the test-takers. The judges nrray.have trou- 
ble identifying test-tekers who are trul^ "borderline" ^ 

You can apply* the bojrderline-group method fey the following se- 
quence of steps: ' , 

1. Select the judges. a V* 

2. Define adequate, inadequate, and "borderline" levels df the skills 
and knowledge tested. 

3. Identify "borderline" test-takers. 

4. Obtain the test scores the "borderline" test-takers. 

* 5. Sef the cutoff score at the median test score of the borderline group, 
This is the score that divides the group exactly in half, i.e., half the 
members above and half below. 

The reason C for using the median, rather than the mean (the usual 
"average"), is that the median is much less affected by a few extremely' 
high or extremely low scores^ This feature of the median is especially 
.important for the borderline-group method, because ajtest-taker with a 
very high or very low score is likely to be someone who did not really 
belong in the borderline group. , ' 

If most of the test scores of the borderline group are clustered close 
together, then the method is working well. But if the scores of the bor-, 
d^rline group are spread widely over the range of possible scores, then 

34 ' 



the method not working well. What can cause the borderline-group 
method to work poorly? 

J. The borderline group may include many test-takers who 4p ^ ot 
belong in it. The judges may have identified several test-takers as 
"borderline" because their skills were difficult to judge. 

2 The judges may be basing their judgments on something other than 
what the test measures. 

3. The judges may differ considerably in iheir individual standards for 
judging the test-takers. 

You may be able to avoid \\\e first problem by re*Tniriding the judges 
not to include in the borderline group any test takers whose skills they 
are not familiar with. You can minimize the second and third problems 
by giving the judges appropriate instructions and by getting them to 
agree with each other, before making their judgments, on a definition of 
"borderline" knowledge and skills. % 



TWe f Contrasting-Groups Method • 

This method is based on the idea, that the Jest- takers can be divided into 
two contrasting groups— a "qualified" group and an "unqualified" 
group— on the basis of the judgments of their knowledge and skills 
Once you have divided the test-takers into these two groups', you can 
consider all the test-takers with a particular test score and ask, "Are tHe 
majority of them' qualified or unqualified?" Most of the test-takers with 
very high scored will be in the. "qualified" group. As you go down the 
score scale, the proportion of the test-takers who are ''qualified" wilkle- 
crease. At the lowest score levels, the "unqualified" test-takers will out- 
number the "qualified" test-takers. One obvious choice for a passing 
score would be the score at which there are just as many "qualified" 
test-takers as "unqualified" test-takers. 

In many cases it will not be practical to get judgments of all test-takers 
in the population. You may have to settle for judgments of a sample of 
the test-takers. How^should you choose the, sample? If you have to 
choose the sample of 'fest-takers before you have given the test, you can 
choose them at random (for'example, by lottery) from among all the 
people whq will*be taking the test. But if yot^fan choose them after they 
have ?aken the test, there is a better way. You can choose the test-takers 
so that their scores are spread evenly throughout the portion of the 
score range where the passing score migh^possibly be located For ex- 

35 




ajnple, on a IDO-question test* you might choose 10 test takers from 
each fiue-point score interval (31-3&, 36-40, etc.). The important prin- - 
ciple to remember is that the" sample of test-takers you select at each 
score level must be representative of all the test-takerS at their score 
level. - 4 

You can apply the contrasting groups method by the following se- 
quence of steps: 

- • 1." Select the judges. T \< 

2. Define adequate and inadequate levels of the knowledge and skills 
..tested. r ^ ' ^ : « 

3. Select the sample of test-takers whose skills will be judged (Omit 
this step if you carl get judgments of all the test-takers.) . - 

4. Obtain the test scores and the judgments of*the test takers you have 
* selected. D/> not let the judges know the test-takers' scores. • ' 

5. Divide the test-takers aj each score level into "qualified" and "un- 
- qualified" groups on the basis of the judgments. Compute the per- 
centage of the test-takers at each afcdre level who are in the "quali-. 
fied" group. (If you do^not have several test-takers at each score 

' level, combine score levels into larger intervalsibefore you do this 
calculation.) ( * 

6. Use a "smoothing* method (explained below) H adjust the percent- 
ages you have computed. - i 

7. Choose the passing score on the basis of the "smoothed" percent- 
age. 

\. 

t+ ' 

"Smoothing" the Data 

When you compute the percentage of the test-takers at each score level 
who are "qualified" (step 5 above), you may find that the percentage 
does not increase steadily from one level to the next. Instead, it may fol- 
low a 2igzag pattern. For example, in Table 6, as you go down the test 
score scale, the percent qualified drops from 100 to 75, jumps to 95^^* 
drops to 60, rises to<69, drpps steadily to 18, then jumps to 43, attffsb 
on. This kind q\ result fs especially likely if the number of test-takers at 
each scoie-teyel is small. Itseems reasonable to assume that if you could 
get jydgnrtents of all possible test-takers, the percent qualified would in- 
crease steadily from one score level to the next (possibly leveling off at 
the highest and lowest levels). What you ne£d, then, is a way to adjust , t 
the percentages to bring them closer to what you would have found if 
Vou had obtained test scores and judgments of all' possible test-takers. 
The general term for adjustments of this kind is "smoothing" Figure 1 

36 • . 

ERIC . ''• 32 



shows why. The *olid line on the graph* connects the actual observed 
percentages. The broken line connects the "smoothed" percentages. 
The broken line is "smoother** and, presumably, closer to the percent- 
ages that would be observed if a much larger group of test-takers had 
been judge/d- * ♦ " 



Table 6. Dafa for examples of smoothing* 



Test Score 


Number of Test-Takers 


JJercent 
/Qualified 


Qualified 


Unqualified 


Total 


96*100 




o. 


5 


* > 
100 ' 


91*95 


3 


. 1 


4 


75 " 


^86-90 ' 


6 


2 


8 


' 75 


,81-85 


18 

• 


i 


19 • 


95 


76-80 


17 


3 


20 


85 


n*75 


15 


- **10 


25 


i 

60 


> * 66-70 


20 


* 9 


* 29 ^ 


69 k 


p 6165 


7 


8 


15 


M7 


. 566Q 


6 


17 


23' 


26 


51-55 


2 


9 \, 


11^ 


18 


46-50 


* '6 


8.. 


14 ' 


43 


41-45 


* 2 


i 


6 


. 33* 


36-40 . 


2 


ii 


14 


14 


. 31-36 


0 




7 


0 


0^30 


0 


/ 3 


3 


0 



'Fronj W Kastnnos and S A Livingston. The Developmerft of a Proficiency fixamlna 
tton for Dental Auxiliaries (Princeton. N J Educational Testing Service, 1979), p 64. 




4 



37 




There are several techniques for smoothing observed percentages 
Some smoothing techniques involve complex calculations, but others 
are extremely simple. All smoothing methods ire based on the idea that 
the judgments of test-takers ateach test score level tell you something 
about the knowledge and skills of test-takers at rteaiby.test sCore levels 
One smoothing method that is ea'sy Jo apply is to draw a graph lik6 
Figure 1, showing the percentages as points, then try to draw a smo6th 
curve that comes a$ close 'to the points as possible. If the«*iu'mber of test- 
takers varies from one level to the next, try to get the curve closer to the 
points that represent larger numbers of test-takers. This technique is 
called "graphic smoothing/* It is somewhat subjective, that is, different 
people applying the method could come up with slightly different 
results. Nevertheless, it works weH, that is, it produces results that are 
very similar to the results of the more objective methods of smoothing. 

38 • - 



3 4 



Another simple smoothing method is to replace the observed per- 
centage at eacIWest-score level with the average of the percentages for 
that scoxe level and the two adjacent score .levels. For example, in Table 

' 6, the "smoothed" percent-qualified for test-scorelevel 86-90 would be 
the average of the percentages for test-score levels 81r85, 86-90, and 
91-95. This number would be the average of 95, 75, and 75, which is 
approxirnately 82. We would expect that in a very large group -of test- 
takers with scores between 86 and 90, the percent judged to be quali- 
fied would be closer to 82 than to 75. 

An improvement bn this method is to weight each percentage by the 
number ©f tes^t-tflkerslat each score level. This procedure has the effect 

- of combining the test\akers at the three score levels and computing the 
percent-qualified for thferrffarged group. Table 7 illustrates this "moving 



4 



Table 7. Smoothing b£ "moving average" 

* # - ; v 



Test Score 


Number df test-takers 
Qualified ' Total 


"Smoothed" 
Percent Qualified 


96-100 


' 4 5 


5 


- , - ; 


. 91-95 


3 


4 


5 + 3% 6 ^ S2% 
-< 5 + 4 + 8 


86-.100 


6 


8 


3 + 6+18 io™ 

4 + 8 + 19 8?% 


. 81-85 


18 


19 


. 6+18+17 _ s?x . 
c 8+19 + 20. b/% 


76-80 


17 


20* 


18+ 17+ 15_ 7S< y 
19 + 20+25 * 


7H75 


i 


25' 


17+15+20-™ 
t 20+]25^29 


66^60 


20 


■ 29 


'15 +.20 +7 
25+29+ i5" 61% 


61-65 


7 


• 15 , 


. ..^andso on/ 



'This method cannot be usecl to estimate thej)ercent qualified at fhe lowest and highest 



test score levels - 

39 



■J. 

• 

average" method. The "moving average^eannot be computed M the 
.very lowest and highest test-score lev^s, but this limitation should not 
often present a serious problem in setting cutoff scores. Notice that the" 
results of this method are "smoother" than the original observed per- 
centages shown in Table 1, that is, the percent-qualified does not 
change so abruptly from level to. level. However, the smoothing did no^ 
remove all the inconsistencies, the smoothed percentage for test score 
level 91-95 is still less than for the two score levels immediately below it 
Different smoothing methods can result in different passing scores. 
Although these differences will tend to be small, you may want to keep 
,the process as objective as possible by specifying which smoothing- 
* method you will use before you collect the data. You may find that the 
resulting curve is not as smooth as you would like, but you will be pro- 
tected against the charge that you deliberately chose a Smoothing 
method that would produce a particular passing score. 



Choosing the Passing Score 

Yfie final step in applying the contrasting-groups method is tije choice of 
i\\e passing score. One logical choice is the test score for which the 
''smoothed'' percenhqualified is exactly 50 percent. At any lower test- 
score level, a test-taker is more likely to be judged unqualified than 
'^qualified, while the reverse is true at any higher test-Score level For the 
smoothed percentages indicated by the curve in Figure 1, this reasoning 
would lead to a passing score of approximately 65. 

The rationale for^ettipg the passing scojre at the test stfore that cor- 
responds to a 50 percent chance of being judged as qualified is based on 
the assumption that the two types of possible wrong decisions about a 
test-taker are equally serious. But what Jf they are not? For example, 
what if it is twice as bad to pass an unqualified test-taker as it is to fail a 
qualified testrtajter? In this case, the passing score should be higher, but 
how much higher? Statistical decision theory (which, at its simplest 
levels. iS really common sense expressed in mathematical language) 
provides an answer to this question. The answer is based on the idea 
tft^t your choice of a passing score* should depend on She total harm 
fr<§m all the wrong decisions you can expect to make 

r If it is twice as serfQus to pass an unqualified test-taker as it is to faiha * 
qualified test-taker, then passing an unqualified test-taker would be ex- 
actly as bad as failing two qualified test-takers. The be»f choice for the, 
passing score would be the test score at which there are exactly two 
qualified test-takers for every unqualified test taker. This would be the 
test score that corresponds to 67 percent-qualified. By similar reason- • 

40 



ing, if it were three times as bad to pass an unqualified test talker as to 
fail a qualified test-taker, the passing score would be the test score at 
which qualified test-takers outnumber unqualified test takers by three to 
one. That is, the passing score would be the test score that corresponds 
to 75 percent qualified. On the other hand, failing a qualified test-taker 
might be the more serious of the two types of errors (for example, if you 
were testing to determine whether a student will receive an expensive 
remedial training* program). In tr^£ case, you might want to lower the 
passing score to the test-scor^Jevel where unqualified test-takers 
outnumber qualified test-taker^ byHwo to one or three to one. 

In practice, you may find it simpler to ask yourself (and any other per- 
sons who are responsible for choosing the passing score) such questions 
as; ?- 

"Suppose you had a group of 100 people and yOu knew that 50 

were qualified and 50 were unqualified. If you had to pass all 100 

or failall 100, which would you do?" 

If your answer would be 'Tail them," then ask the same question for a 
group of "70 qualified persons and 30 unqualified persons. If your an- 
swer would now be "Pass them," ask the same question for a group of 
'60 qualified persons arid 40 unqualifiedpersons. Keep adjusting the 
percent qualified in this way until you have found the value at which 
you cannot decide Whether to pass the group or fail the group. The test 
score that corresponds to this percent-qualified will be the score at 
whic^i y6u cannot decide whether a fe^t taker should pass or fail— that 
is, the passing score., 

Hqw Many Test-Takers? 

One question that test users often ask about the contrasting groups 
method is. "How many test-takers do I need?" 4 THe only honest answer 
to this question is, "It depends." Deciding how many test-takers to in- 
clude in a contrasting-groups stu8y generally involves a tradeoff be- 
tween costs and benefits. The costs are those of getting the judgments. 
Edging more, test takers will require more time from ,the judges, or it 
may require you to select and train more judges. It, may also require 
time from more of the test-takers. The benefits of a larger sample are 
better representation of the test taker population and greater precision 
in determining the passing score. The degree of precision you can get 
with a given number of test-takers depends.on several factors: 
— the extent to which the test/scores and the judgments both reflect the 



same abilities 'of the test-takers; 




37 



—the extent to which the test scores and the judgments are free of other 
* influences: 

—the consistency of the test-takers* performance: 

—if different judges judge different test-takers, the. extent to which the 

judges have the same standards; 
—the consistency with which the judges apply their standards in judging 

the test-takers. 

The degree of precision you need will depend on the number of peo- 
ple who will be affected by the choice of the passing score and on the 
consequences of passing or failing the test. It will also depend on how 
fine a distincttorTyou are trying to make. A choice between passing 
scores of 3 and^4 on a five-point test \$ much easier to make than a 
choice betvyeen passing scores of 73 and 74 on a 100-pomt test. ^ 

One of us has used the contrasting-groups method with as few as 20 
test-takers, but the circumstances of that study were somewhat unusual 
Only seven test-score levels were being considered as possibilities. Each 
test- taker was judged t!y efght judges, and the judgments were based on 
a sample of performance from the test itself. {It was a test of English- 
speaking pfofic?ency for persons whose native language was not En- 
glish.) # Most cir&mstances would call for judgments of a considerably 
larger number of test-takers. ; 

The costs of getting judgments ot individual test-takers, the precision 
that a given number, of test-takers will provide, and the need for preci- 
sion in settinglthe passing score will all vary from one testing situation to 
another. Therefore, we cannot prescribe a.*ninimum number of test- 
takers that will apply to all testing situations. We can only suggest that 
you (1) include as many test-takers as you can afford to, and (2) consult 
a statistician for advice that will apply to your testing situation, 



*Fora description of this study. see-Samuel A Livingston. "Setting Standards of Speak 
ing Proficiency. " pp 255-270 in Direct Testing of Speaking Prdfiaency Theory and Ap 
plication. J L D Clark, editor (Princeton, N J.. Educational Testing Service. 1978) 

38 



The Up-and-Down Method 



One problem that often makes it difficult to use the contrastin^groups 
method is the effort and expense involved in getting judgments of indi- 
vidual test-takers' skills. In many cases, the effort and expense depend 
directly on the number of individual test-takers to be judged. The more 
judgments, the greater the cost. Therefore, you will want to concentrate 
these valuable judgments in the part of the test-score range where you 
most need them — the part where about half the test takers are qualified 
and half are not. But until you have collected the judgments, you will 
not know where this part of the score range is. Is there any^way out of 
this dilemma? In some situations, the answer is "yes ." If the test-takers 
take the test before the judgments of their skills are made, and if you can 
select the te^t-takers for judgment one at a time, you can use a variation 
.of the contrasting-groups method called the "up-and-down method " 
The up-and-down method should work especially well where every 
test-taker's performance has been "recorded and is available for judging, 
as in the case of a writing sample or an essay test. Here is how it works: 

if Select a test-taker with a test score near where you think thejDroper 
* pasafng score migfif b£. Get a }ttdgmW^fW$ tes^tSker's sfiills/ : 

2. If the first test-taker was judged to be qualified, choose next a test- 
taker with a somewhat lower test score. If the first test-taker was 

* judged to be Unqualified, choose next a test-taker with a somewhat 
higher test score. Get a judgment of the second test-taker's skills. * 

3r"Repeat step 2, choosing the third test-taker on the basis of the judg- 
ment of the second test-taker. Continue by choosing each test-taker 
on the basis of the judgment of the previous test-taker. 

Figure 2 illustrates an application of the up-and-down method The 
letters Q and U in the figure represent' judgments of the test-takers as 
being qualified or unqualified. Notice th^way in which the method 
automatically tends to move down from test-scor.e levels where all the 
test-takers are qualified and up from test score levels where all the test- 
takers are unqualified. The scores of the test-takers selected will tend to 
concentrate tn the range where a test-taker is about as likely to be quali- 
fied as to be unqualified— which is where the passing score should be, 

To choose the passing score ori the basis of data collected by the up- 
and-down method, you can simply take the average test score of the 
persons selected for judging, "beginning just before the scores start to zig- 
zag and ending with the score of the next person who would have been 
judged if the procedure had continued. That is, disregard the first run of 

43 

; 3S 



t 



Test 
Score 
18 

17 

16 

15 
14. 
13 
12 
11 
•10 



Test-taker 

7 8 9 10 11 12 13 ] 14 15 16 



9 























.A 










V 


























• 


• 








< 




4 
























N 






— - c 


J — — 

v" 














T * 




iz 














\ 








\-, 





































































To find the passing seore, average thelest 
\, scores of test-takers 4 through l6. 



Figure 2. Example of the up-and-down method (hypothetical data)^ 

40 ' . 



qualified persons or of unqualified persons, except for the last person in 
that run. For example, in Figure 2, the first four test-takers were all 
judged to be qualified, so we would start with the fourth test-taker The 
16th test-taKer was not Actually judged, but we know that person's test 
score, so we include it in the average. The'passing score would be the 
average score of test-takers 4 through 16, which' is 12 8. Of course, in 
most situations you would want to get judgments of more than 15 test- 
takers. * " * ^ 

A variation on the up-and-down method is to select more than one 
test-taker at a time. For example, you might select three test-takers at a 
time, all with test scores at the'same level. If at least two of them are 
judged to be qualified, you would move down to a lower test-score level 
for the next three; otherwise you would move up^o a higher level. 

You can use this variation of the up-and-down method to find the test 
score for which the percinf-qualified is something^pther than 50 per- 
cent. For example, suppose you want to find the score level at which 
two-thirdsof the test T takers are qualified. You could select nve test- 
takers at a time. If four or five (that is, more than two-thirds df the five) 
are judged qualified, you would move down to a lower test-scorfe level 
for the next group of five, otherwise you would move up. A word of 
caution? If you are looking for some percentage other than 50 percent, 
you shoilfd not set the passing scoreby averaging the test scores of the 
persons you select. Instead, you should treat the data as you would in 
the regular contrasting^groups method. (1) compute the percent- quali- „ 
fied attach score level, (2) smooth the percentages if necessary, and 
(3) find the test-score level that corresponds to the percent-qualified you 
have chosen. *. 

If you are, using the up-and-down methVl to choose a $(53ing score, 
it is important not to stop until'you have observed several "reversals " A 
reversal is a change in direction, from up to down or vice versa. For ex- 
ample, in Figure 2, the reversals come after test-takers 5, <?, 8," 9, 10, 
11, and 13. The Importance of these reversals .is thaUhey will tend to 
come frequently in the range jwhere the passing score should be la 
* other parts of the test-score range, there~will be fewer reversals. The 
more reversals you have observed, the more likely it is that you have 
found the right portion of the test-score range. 

How large.should the^steps be? Jhat is, how far down the test-score 
scale should you move after a success, and how {ar ur> after a failure? 
The larger the steps, the more quickly you can find the part of the test- 
score range where the passing score should be. On the other hand, 
smaller steps will give you a more precise estimate once you reach that 
range. Therefore, we suggest the following procedure. Use large steps 
until you have observed.at least five reversals. Then take one last large' 



step, and switch to smaller steps. A large step might be one-eighth of 
one-tenth of the range of actual scores on the test (that is, of the differ* 
.ence between the highest and lowest of the test- takers' scores) A small 
step might be about half that size. For example, if the tesHakers scores 
range from 20 to 80, you might start with steps of 6 test-score points 
and then shift to steps pf 3 test-score points, as in Figure 3. \ 

One possible problem with the up-and down method is that if tht 
"judges know you are using it, each judgment may be affected by the 
previous one. That is, if a judge knows that the test-taker now being 
judged had a higher test score than the previous one, the judge may tre 
more inclined to judge the test-taker as qualified. We suggest that "you 
not tell the judges what rule you are using to select the test-takers until 
the judging is finished. The judges .may figure out the principle by them- 
selves, >ut unless you tell them, they wilt not be sure you are following it 
consistently. Jher^hxe^ they wffTbe more likely to continue to judge 
each test-taker as an individual. « 



42 



r 



Test 
Score 

*62 
56 

50 
44 

32 
26 

' *20 



Test- taker 



er|c 





U- 






L 




























(1st Ve versa!) 




i 








1 






























-I 

4 


— — ^ 

/ 


^ 

\ 


3 






















— — 










—————— 












... 
X 












































<L 


1 


c 


) 








> 


~- — I 


— v 

•/ 


e — 

V 

,( 












id 










) 


• 


\ 

c 


} — 

V 

1 






L 

/ 

i _ 


/ 








r\'l 














y\ 


\\ 

r 


j 






































.(2nd) 

—-4 — 


(4th) 

_4 — 





























Figure 3. Exampie of the up and down method wtth a change in the step size (hypothetical data) 



Methods 
\ Based on 
Judgments About 
a Group of 
Test-Takers- 



The methods 'described in this section are based on judgments about a 
group of test takers — preferably a large group. This group is often called 
.the reference group. The simplest of these methods, and the One with 
the most obvious justification, is to choose the passing score that Would 
have passed a specified number (or a specified percentage) of the test 
takers in the reference group. For example, if you have reason to 
believe that 85 percent of last year's test takers were qualified, you can 
find the score that would have passed 85 percent of last year's 5 test- 
takers and use that score as,a passing score for this year's test takers 
the test changes from year to year, you will hav& to find the score on this 
year's test that would have passed 85 percent of last year's test takers, , 
by using a statistical technique called "equating The jucjgnient of the 
percentage of the test takers, in the reference group who were qualified 
leads directly to the choice of a passing score. This judgment should be 
based on some type of information other than the test scores. 

Does a passing score chosen by this method represent an absolute 
standard or a relative standard? The answer to this question depends^n 
the reference group. If the-referen^e group is the 9 rou P °f test takers the 
passing score will be applied to, men the standard is a relative standard, 
in this case a" test-takers relative standing in trie group determines 
whether or not he or she passes the test. But if the reference group is a 
previous group of test takers, it has the effect of setting an absolute starf 
dard. From the fest taker's point of view, the passing score has already 
been determined. Any test taker wrfo scores higher than that score ,will 
pass the test, even \( the other test takers all score higher still. And any. 



Tor information on equating, see the cjiajtfer by W H' Angoff cited in the bibliography of 
this manual > O *<* 

' , 49 



4< 



test-taker who scores lower than the passing score will fail, no matter 
how poorly alhhe others do.* 
You can apply this method by the following sequence of steps 

1 Identify the reference group 
2. 'Select the judges 

3 Define adequate and inadequate levels of the knowledge and skills 
tested. 

4 Collect judgments of the percentage of the people in the reference 
group who have an adequate level of the knowledge and skills 
tested ' ( 

5, Choose the passing score 

Steps 1 and 2 are interdependent, your choice of a reference group 
will depend on your being able to find judges who can. make a valid 
judgment aboutlhat group. 

The reference group should be fairly large, so that the judgments of 
the percentage of the test-takers who are- qualified will not depend 
heavily on onfe.or two of the test-takers. You do not need |o*know the 
test scores of individual test-takers, but you do* need to know how many 
test-takers in the group received-each test score. 

The judges must be able to judge how many (or what percentage) of 
the test-talkers m tfre reference r group are qualified in the knowledge and 
skills the test measures. Therefore, they must know what the test mea- 
sures ahd what level of these skills is necessary. They must also be fa- 
miliar with the abilities of the reference group, as a group: They do not 
have to identify specific -individuals as qualified or not qualified, but they 
must be able,to judge approximately how many are qualified ' 

Defining Adequate and inadequate levels of the knowledge and skills 
tested can be done in the same way as for tbe methods we have dis- 
cussed previously. This is an important step in. the process, in, this 
method as in any other method, because. this definition^will determine 
the meaning of the standard. 

the judges can make their judgments individually or as a group 
Again, we recommend a compromise procedure: 

1. Have each judge make a preliminary judgment* 

2. Write the judgments on a blackboard or a large sheet of paper 



1 1 

*ln 1%| the National Board of Medical Examiners changed from a standard based on 
currcrtt test-takers to a standard based on previous te{t takers, for exactly this reason , 
( The National Board Examiner, v 28.no 1. Winter 1981 Philadelphia National Board 
of Medical Examiners ) , 



3. Ask for a Judge, who chose a high number to explain why. Then ask 
for a judge who chose a low number to explain why. Allow some 
discussion, but do not try to get all the judges to agree. 

4. Give the judges a chance to change their judgments if they want to* 
Then collect the revised judgments. 

You can combine the judgments by computing the mean, the median, 
or the trimmed mean, as described earlier on pages 22-23. 

The mam limitation of this method is that the judges must be able to 
judge the number or the percentage of the test takers in the reference-, 
group who aje qualified m the knowledge and skills the test measures. 
This kind of judgment is not easy to make with any reasonable degree of 
precision. However, if you can get an approximate judgment of this 
type, you can use tfris method as a reality check on the methods based 
on judgments about test questions For example, if you can be fairly 
sure that at least 75 percent of last year's test-takers were qualified, you 
should be skeptical of any method that produces a passing sqore that 
wouldihave passed less than half of last year's test-takers. 

One example of setting passing scores by using judgments about 41 
groups of test takers is the awarding of college course 'Credit, on the 
basis of an examination, to students who have not take'h the course. 
Typically, the college will have the students in the course take the ac- 
creditation test at or near the end of the course. When the students' 
grades have been determined, the testing office* computes the distribu 
tion of test scores for the A students, for the B students, and so on. The 
- college can then set the passing score on the basis of these distributions. 
• One popular choice is the ''mean C— the average test score of the C 
students. This choice means thaf^if a student who ha$ not taken the 
course can score as high on the^test as the average C student did after 
takfng the course, that student will get credit for the course. 5 

Another method based on judgments of groups of test takers is sim 
ilar to the contrasting groups method described earlier, except that It 
does not require judgments of individual test takers. Instead, you iden 
tify a group of persons who canine presumed to have the qualifications 
the test is intended to measure and a group of persons who can be pre 
Sumed to lack these qualifications (for example, students who have had 
the relevant instruction tmd students who have not*). You then select a 
. sample of persons from each group (tKe sam£«n«mber of persons from 
each) and give them the test. You set the passing score at the test score 
level that best discriminates between the two samples. This method will 

r 



' See the article by R A, Berk listed in the bibliography 



51 



46 



not necessarily produce the same result* the contrasting-groups 
method based on judgments of individual t<?sHakers Therefore, it will 
not necessarily minimize the number of wrong decisions in the group of 
test-takers the test is intended for. It will do so'gnly if (1) the tes^scores 
of the * qualified" group are representative of the scores of the qualified 
people who will be taking the test, and (2) the test scores of the "unqual- 
ified" group are representative of the scores of the unqualified people 
who will be taking the test, and (3) the proportions of "qualified" and 
unqualified" people are the same in the standard-setting study as in the 
group of people the test is intended for 



52 



Choosing a 
Standard-Setting 
Method 

Which Method is Best? r 

There is no one method that is best for all testing situations. Your choice 
of a method should depend on ujhat kinci of judgments you* can get— 
and believe We believe that the best kind of data to use — if you can get 
them— are the test scores of real test-takers whose performance has 
been meaningfully judged by qualified judges. If yfbu can have the 
judges actually observe the test-takers' performance or samples of their 
work, we recommend the contrasting : groups method This situation 
will occur fairly often with essay tests, hands-on performance tests, etc 
For multiple-choice tests, we recommend using the cohtrasting-groups 
metKod whenever you can be reasonably sure that the judges wil) base 
their judgments on, the same qualities of the test-takers — the same 
knowledge and skills— that the test measures. The~contrasting-groups 
method has the strongest theoretical rationale of any of the methods we 
have presented, that of statistical decision theory. It is the only standard- 
setting method that enables you to estimate the frequencies of the two 
types of decision errors. The main disadvantage of {he contrasting- 
groups method is the difficulty of getting the necessary judgments 

If you cannot get valid judgments of an appropriate sample of the 
% test-takers,* but each judge can confidently identify individual test- 
takers as good examples of people with "borderline" qualifications, we 
recommend the borderline-group method. If the judges can best ex- 
press their standards m terms of the performance of a particular group 
of test-takers (for example, "at least as good as the average C student"), 
we recommend setting the standard in those terms. 

If none of these (conditions can be met, we suggest you use one of the 
methods based on judgments about test questions— Nedelsky's. An- 
goffs, or Ebel's— but we also suggest you compare the results of that 
method with real test-score data. Be prepared to. compromise if this 
comparison suggests thatthe judges' standards were unrealistic 

Methods such as Nedelsky's, Angoffs, and Ebel's are especially use- 
ful when it is important that the passing score represent the standard of 



'See pages 35*36 of this manual 



53 



a farge arid diverse group of people For example, in choosing the pass 
mg score on a math test used'as a requirement for high school grad- 
uation, it may be important to include the opinions of parents, employ 
ers. and community leaders These people are npt in a position to 
observe the mathematical skills of high school students, so they cannot 
serve as judges m the borderline group* or contrasting groups method 
But they cari serve as judges in.Nedelsky's. Angoff s, or Ebel^s method.* 
Nedelsky's. Angoff s,' and EbeFs methods require the judges to re- 

* View, the test if security considerations prevent you from showing the 
test even to the judges, you may be able to wait and hold the judging 
session after the^test has been given If you do not have this option, you 
may be able to collect the judgments and set the standard on another 
form of the test (containing differentxjuestions measuring the same abili 
ties) if the form to be judged will be statistically equated to the form you 
will be using, if none of these options is open to you, you will not be 

£ able to use orfe'of these methods 

In choosing between Nedelsky's. Angoff s. and EbeFs methods, 'your 
main concern should be the type ofjudgments the judges can make 
mosj meaningfully Angoff s method requires the judges either to think 
in terms of probabilities (which is difficult for many people) or to imagine 
a group of borderline test takers (which may be far removed from the 
judges* experience). However, Angoff s method is the easiest of the 
three methods to explain and the fastest to use. Ebel's method enables ' 
the judges to take account of itfe difficulty and the importance of each 
test question, Thisieature is e,speciall^valuable when trre questions on 
the test differ widely m their importance Its disadvantages are its slow 
ness and its ur^ntabihty for short tests. Nedelsky's method takes ac 
count of the fact that the difficulty of a multiple choice question depends 
,»on just how wrong the wrong answers are. However. Nedelsky's nieth- 

* od can^be difficult to use when the questions are negatively worded or 
- contain other type^. complexities. 



'An article by R M Jaeger.- listed in the bibliography, presents another method of the 
same general type, developed specifically for tests used as a requirement for high school 
graduation f 



54 

ERIC , . : ,44 



9 



Socitol and 
Political Issues 



Choosing the passing score on a test often leads to controversy. The 
controversy may focus on your choice of a method or your selection of 
judges, or it may focus on any of a number of other issues. You should 
thmk about these issues before you begin the process of choosing a 
passing score, £ven,if you decide not [to take positions on some of these 
issues, you will be better able to avoi i destructive controversies— or to 
resolve them if they occur — if you have thought about the issues before 
hand? \ 



• Should You Allow Exceptions to 
Your Decision Rule? 

A common criticism of the use of a passing score is that it fails to allow 
for exceptions. There may be good reasons for making exceptions to a 
rule. If you decide Qpt tq allow any exceptions, you may be forced to 
jmafce a decision %at is unreasonable under the circumstances. For ex- 
ample, a 'test-taker may have a particular handicap that results in a 
lower score than othejr test-takers with the same level of knowledge 
would get. If you could anticipate^!! the possible reasons that would jus- 
tify an exception, you could write them into the decision rule. Unfortu- 
nately, no human being can foresee all the possible circumstances in 
- which a decision rufe would be unreasonable." * - 

Th£ problem with allowing exceptions is that ojice you have made an 
exception, where do you stop? You may find yourself pressured by 
people seeking exceptions for reasons you do not consider legitimate 
Also, exceptions tend to undermine peoples faith in the fairness of your ^ 
decision procedure. An exception that some people regard as compas- 
sion may look to others like favoritism. * 

One way to deal with this dilemma is to have an established proce 
dure for determining whether an exception should be allowed. You 
might form a standfing committee to approve or deny requests for ex 
ceptions If you find a particular type of special circumstance occurring 
frequently, you can modify your decision rule to cover it. Each time you 
modify the decision rule in this way. yoCi will reduce the number of ex 
ceptions you will have to deal with in the future. 

•. 55 

ERiC * 50 ' 



Should. You Allotf Test-Takers Who Fail the Test to 
Take it Again? 

In most cases the answer to this question will be f 'yes " A test-taker may 
have a "bad day" on the day of the test If so. the test-taker's score will 
not represehl his or her true level of ability But if you do allow retakes, 
shoul^l you limit the number of times*a person can take the test m an at 
terriptito pass 9 Should you require persons who fail the test to wait a 
specified length of time before retaking it 9 Should you require them to 
take-Jome sort of instruction before retaking the test 9 Should you retest 
them ^ith a different fown of the test (that is, one with different ques 
tions or problems constructed to measure the same general types of 
knowledge and skills)? 

In most cases, a person who retakes a test should be given a different 
form of the test each time Otherwise, the person may becorrie a spe- . 
cialist in the specific problems and questions on the test, without learn- 
ing the more general knowledge and skills those questions are intended 
to represent As long as different forms.of thetest are available, we pre- 
fer not to limit the number of times a person can take the test No matter 
how many times the person has failed the test, it is always possible that 
the person's skills may improve . ^ v 

Whether to require a waiting period for persons who wanf to retake 
the test will depend on your particular testing program If the testing 
procedure is expensive and the test-takers are not the ones paying for it. 
you may want to require a waiting period as an incentive for- the test- 
takers to improve their skills before retaking the test Another way to 
make sure the test-takers are adequately prepared is to require failing 
test-takers to have additional instruction in the knowledge and skills to 
be.tested. before retaking the test * 



ShouldJPersons Who Have Passed the Test 
Ever Have to Take it Again? 

There are situations in which* such a requirement makes a great deal of 
sense, particularly where an unqualified pjerson represents a danger to 
others. For example, airline pilots are required to demonstrate their 
skills not just once, but every six months as long as they continue to fly 
In deciding whefher this type of requirement makes sense in your test 
mg situation, you should consider questions like these Could a person's 
level of ability decrease over time? What could happen if it did? \%th& 
test changing from year to year, to include new knowledge and skills 9 
What coufd happen if a person has mastered the old knowledge and 
skills but not the new? 



Should You Establish an "Uncertain" Category? < 

When you use a test with a single passing score, two kinds of decision 
errors are possible. An unqualified person may get a score above the 
passing score, a qualified person may get a score below the passing 
score. One way to reduce the chances of both kinds of errors is to estab- 
lish an "uncertain" category. For persons in this middle category, you 
wilt have to get additional information before making the pass/fail deci- 
sion. This additional information might be another form of the same 
test, or a different test, or some other type of evaluation 

To establish an "uncertain" category you will have to choose two crit- 
ical scores instead of only one, since you are dividing the test-takers into 
three groups instead of only two. With some methods, this modification 
will double the time and effort* required. However, with the contrasting- 
groups method, you may be able to choose two critical scores with very 
little extra work. If ytfu have estimated the relationship between a t$st- 
thker's score and the probability that the test-taker will be judged as 
qualified, you can specify the two critical scores in terms of these prob- 
abilities. For exarrlple, you might decide to pas's any test-taker with 
more than a 75 percent probability of being qualified, fail any test taker 
with less than a 25 percent probability, and .seek additional information 
about therest. 

There may be situations in which you cannot get any additional infor- 
mation about the test-takers. If no other information is available an8 the 
test-taker cannot even retake the test before a decision must be made, 
an "uncertain" category may not be of much help. 

Should You Use Normative Information in 
Setting an Absolute Standard? 

This is, to some extent, a philosophical issue. Even an absolute stan- 
' dard is ultimately normative. That is, people's judgments of what a per- 
son should be able to do will always depend" to some extent on what 
people can do. However, there is often a gap between what people (for 
example, students) can do and what other people (for example, instruc- 
tors) trunk they should be able fo clo. We believe that if you are using a- 
method based on judgments about test questions, it makes sense to use 
normative data as a "reality check." In this case, we suggest that you not 
share the normative information with the judges until after they have 
made their initiaj judgments. Then you can let them know how a group 
of real test-takers performed on the test. If the judgesVjj|ea of a "border- 
line*' test-taker is someone whose performance approaches or exceeds 
that of the average actual test-taker, their standards may be unrealisti- 
cally high. Even if you are using a method based on judgments about in* 

57 



dividual test-takers, it may make sense to use normative information 
about tesfctakers wh,o were not judged, as a check on the process of 
selecting the test-takers and collecting the judgments If you know that 
most of the test-takers are qualified, and yet the majority^of them have 
test scores like those of the persons who were judged as "unqualified," 
you have reason to suspect that something went wrong 

Should You Allow the Standarcf to ' 
Change Over Time? 

In many types of testing, continuity of the standard is Important. For ex- 
ample, if the test is a requirement for a diploma or a , certificate , the 
meaning of the certificate will change if the standard changes But if the 
test is changed from year to year, it may be easier in some years and 
harder in others. One w^y to maintain a constant standard is to adjust 
the passing score to account for the differences in the difficulty .of the 
test. However, such an adjustment may appear to be a change in the 
standard, even though its purpose is to avoid a change in the standard 
Therefore, the adjustment may cause political problerns. Fortunately, 
you can also maintain the standard by a3justing the test scores to com- 
pensate for the change in the difficulty of the test and leaving the pass- 
ing score unchanged. This type of adjustment is called "equating " It is 
an accepted and widely used technique in educational testing, but it re- 
quires certain types of information linking the two forms of the test. For 
example, the two forms of the test may be designed to have several 
questions in common, pr both forms of, the test may be given experi- 
mentalh/SEf-a group of test-takers.* 

In other types of testing, it may be desirable to have some flexibility in 
the* standard. Conditions may change over time. Technological ad- 
vances may change the levels of certain skills required- in an occupation 
A critical shortage of people in an occupation may make it necessary to 
lower the standard. Changes in the educational needs of the children in 
a school district may require a revision in the standard. Even in the 
absence of such changes, experience with the effects of using a particu- 
lar «indard may indjcate that a revision would be desirable. Here 
again, equating is necessary if the test is changed /rom year to year, to 
adjust for the differences in difficulty that may result. Without equating, 
if this year'siest is easier than last years tejk you may think you are rais- 
ing the standard when you are actually lowering it. 



J *Foi more information on equating, see the chapter by W H Angoff cited in the bibiiog 

58 " " N - / 



ERIC * J r 



Should You Set Different Passing Scores fo* 
Different Groups of Test-Takers? 

In some decision-making situations, the test-takers may come from dif- 
ferent instruciionalbackgro\inds. For example, some of th| people tak- 
ing a test for certification in a profession may. have completed a formal 
training -course, while others may have acquired their professional 
knowledge-and skills informally, on the job. The test-takers without for- 
mal training may tend to do poorl^o'n the test, but much better in a 
' practical work situation like the one in which they have gained their ex- 
perience. However, the use of a lower passing score .for these test- 
takers may appear to be a purely political concession, even if it is not in- 
tended to be The best solution is .to use a test that measures only the 
knowledge and skills the person actually uses orr the job (or comes as 
close, as possible to this ideal). Also, make sure the test is easy to read 
and free of tricky questions (for example, questions containing wrong 
answers that are nearly correct). If there are pictures or diagrams on the 
test, make sure they look like tbe things they ; are supposed to represent 
If the test has already been made up. you may find-you have to delete 
some questions in order to make it fair, 



ERIC ; .54 



59 



Helpful?', 
Hints ~ 



When you choose the passing score on a test you are making explicit 
the lowest level of performance that will be considered acceptable. 
Some people may think that the level you have set is absurdly low. 
Others, partic Jlarly those that fall below it, will think that the level is un- 
fairly high It will be difficult to convince either group that your passing 
score is appropriate, because there is no purely objective way to set 
standards. All methods of setting standards depend on some type of 
subjective judgment at some stage of th^ process. Critics will be able to 
argue that those judgments were wrong. You will never be able to prove 
that your passing score is correct but there are steps that you can take 
to increase the probability that your passing score,will be accepted.* 

Be Prepared to Explain Why You Are 
Using a Passing Score ; 

Even though a passing score may lead to fairer decisions than those 
made on a case by case basis, some people wilNperceive the, use of a 
passing score as arbitrary and unfair. You should beprep&ted to explain 
the reasons,for the use of a passing score in your particular testing situa- 
tion. In particular, you should be prepared to^swer trieTo1(pwing 
questions; \ 

.How are the decisions to be made ort the basis of the passing score » . 
' being made now? > 

Why will the use of the passing score be preferable to the current ^ 

system? * f ; 

You should try to anticipate any harm thai might be caused by the use 
of a passing score. You should also be ready to point but the harm that 
would be caused by nof using a passing score— that is, by making the 
decisions the wa^ they would be mad£ otherwise . 

Evaluate the Test 

The test should be adequately reliable and valid for its intended par 
pose. It should be free of bias, groups of test takers should not.differ sys 
tematically in their scores unless they truly differ in the knowledge and 
skills the test is intended to'measure. If the'techniques of evaluating test 
score reliability, validity, and lack of bia% are not among your competen 



55 



cies, g4t help from people, who dp understand these techniques as they 
apply to tests used with passmg scqres. An explanation of tljese tech- 
niques is beyopd the scope of this manual, but it is obviously impossible 
to set acceptable standards on unacceptable tests. 

In addition to any empirical evidence of tgst quality, you should ob- - 
tauv judgments about the test from people who represent those who will 
be affected by the test Their opinions about" the appropriateness of the 

* test will be important in determining their alcSeptance of the passing 
score, A dozen favorable references in the Mental Measurements Year- 
book will not persuade people that your test is acceptable if the test sim- 

' ply does not look right. . ■ 

Make Sure the Judges Understand What the Test 
is Supposed;to Measure i 

• Some of the methods we have presented (Nedejsky s', Arigoff s, Ebel's) 
require the judges to review the test in detail. Others do not. We recom- 

"mend that, no.matter what m'ethcfd you are using, you have the judges 
loo*k at the test, .unless you have a reason not to (for example, test se- 
curity). We also suggest that^ou give the judges a concise description of 
the knowledge and skill's the test is intended'tormeasure- If you cannot 
allow the ludges to look at the actual test, we suggest you give them a 

•defgi/ecTdescnption of the knowledge and skills the test is intended to 
measure arid a few sample questions similar to, those on the test This 
kind offpreparation will help to guard against the kind of misunderstand- 
ing that can. lead to judgments that are not based on the abilities mea- 
sured by the test. * 1 
. « * * , 

-Make the Process of Setting the Passing Score 
as Open as Possible 

The fact that a parsing score will be set and'the way in which it' wUlb^ge^^ 
should be well publicized. People should have a chance t<5 proWe sug- 
gestions and comments early enough in the process to al[ow,you to act 
on the information that you receive. Fer example, if you are settingthe 
passing score for a tesj to be used as a requirement for a high school 
diploma, parents, students, teachers, school bpard members, schotrt 
administrators, members of community groups, and loeah employers 
should aty^e encouraged to participate. In many situations it wjll.be inv, 
portant to involve members olracial, ethnic, and cultural minorities 

Though it may be impossible Jo* have, face-to-face meetings, with alt 
the oeople who'may be interested, you can encourage them to^rite to 
you about their concerns. Make it hard for people to say that they did . 
- not have a chance to become involved and state their views The more 

*62 



public involvement tfeere is throughout the standard setting process, the 
more likely that the jjgssing score wilt be accepted when the process is 
completed. ■ . " * * 

%- 

Make Sure People Understand How 
the Test Will be Used 

It is important that people understand why the test is being given They 
should know what kinds of decisions will be made on the basis 'of the 
test scores and what kinds will «oot. If a possible use of a test threatens 
people, and if you do not intend to use the test in that way, say so For 
example", if a certification test will be used only for the certification of 
nevVappIicants, be^sure to tell currently certified people that this require- 
ment wilt not be applied to them. In a school setting, if a test is to be 
used only to identify students needing remedial work,- state explicitly 
that the scores wiflnot be used to evaluate teacher performance. Jry to 
anticipate people's concerns about threatening but unintended uses of 
the test and make public. guarantees that the test will not be used in 
those ways. 



Give Adequate Notice That a Passing Score 
Will Be Applied 

It i&Vinfanr to jnake people comply with new requirements Unless they 
are gtven enough time to prepare. Income instances, jt may even be il- 
legal because due process of law requires adequate prior notice of a 
new rule that may .deny benefits to a person. .Whether or not prior 
notice ^required (and the kind of notice required) will vary^with the sit- 
uation. For -exafrtple, people may have entered an accreditee! training 

• program under the condition that they would receive their certification 
after completmg^the program ajnd acquiring a certain amount of experi 
ence. The imposition of a new barrier, in the form of a test that they s 
must pass* could lead to legal challenge People who enter the program 
kn&wmg that they will fiave to pass^ test are far less likely ^challenge 

.thalrequVernent. It may be wise to consult a lawyer before you institute 
a passing score, that may be usedto^eny benefits to people who would 
otherwise be eligible for them. V, 

Develop Inf orjmativeScoi e Reports 

We believe that a person who* has taken a test that is used with a passing 
score should receive at least two typ^of information, his or her own 
score ancf the passing score. You shoulasalso consider providin 
tional diagnostic information tharmight bai useful to the test-taker. Every 
though the passXfal decision may be bas^J on a single total score, the 



ERIC . * , . ' 57 




test taker may find it us^ul to know how many questions he or she an 
sweTed correctly, and how many incorrectly, in each of the main in- 
tent categories. Few things are;nore frustrating than receiving a failing 
grade on a test withput "being given an^ other information. The more in- 
formation you provide, the more likely it is that your testing program 
will be accepted. 

Allow Plenty of Time for Choosing the Passing Score 

One good way to plan your schedule is to count backwards from the 
time the process. must be completed. 6e sure to allow time for all the 
necessary^activities. You will have to select and tram the judges If you 
are using a method you have not used before, you should allow time for 
a small-scale practice run to make sure the procedure will work prop; 
erly. You may have to allow time for printing the test, administering the 
test, and scoring the test. , * 

Even if the test is available and you are setting the passing score by a 
method that does not require test admrnistration, the process may take 
longer than you anticipate. For example, it may be difficult to fir\d a time 
when all of the judges are free. You may discover, after collecting the 
judgments, that some of the judges simply did not understand what 
they were doing. In this case, you will have to repeat the judgmg pro- 
cess. In setting a passing score, as in other areas of life, it is usually wise 
to assume that if something can go wrong, it will. 

Review the Process % 

Before you actually begin to use the passing score to make decisions, 
review the procedure by wfaich the passing score was chosen. If some- 
thing in the process was not right, you will want to find outTabqut it 
before' you have begun to apply the passing score The following ques- 
tions may help to focus your attention on things that might have gone 
wrong: ^ 

Were the judges all qualified 16 make the kinds of judgments they 

were making? , 

Were the judges a representative group? 

Did Jhe judges understand their task? 

Did the judges have enough time to complete their task carefully? 
' Were all the necessary calculations done correctly? 

In addition, if the method was based on judgments abaut test- taker's, 
consider the following questions: ^ 



Did the judges know enough about the test takers to make valid 
judgments? 

Did the judges concentrate on the same knowledge and skills that 
the test is intended to measure? 



Observe the Effects of Using the Passing Score , 

Once you have begun to use the passing score to make decisions, try to 
get information that will enable you to judge its appropriateness. Make 
an effort to get opinions from the different types of people who are af- 
fected In the schools, these would include administrators, teachers, 
students, and parents In an occupational setting, these would include 
the test takers, their colleagues, and their supervisors. Try to find out 
what happened to people who failed the test. Is there evidence that 
many of them were actually qualified at the time they took the test? Is 
there evidence that many of the people who passed the test were un- 
qualified? What were the consequences of failing a qualified person? Of 
passing an unqualified person? 

The information you get may well be inconclusive. However, it may 
indicate-that the passing score was clearly too high or too low. In that 
case, you should be prepared to revise it. 



58 



65 
\ 



Conclusion. 



Ail methods of standard setting require judgment. The process of setting 
a standard can be only as good as the judgments that go Jnto it The 
standard will depend on whose judgments are involved in the process 
In this sense, all standards are subjective. Yet. once a standard has been 
set. the decisions based on it can be made objectively* Instead of a sep- 
arate set of judgments for each test taker, you will have the same set of 
judgments applied to att test-takers. Standards cannot be objectively de- 
termined, but they can be objectively applied 



67 



60 



1 



Bibliography 



Note This bibliography is limited to works that have been published as 
of July 1981 and deal with 4he problem of setting standards 

Andrewv^B. J and Hecht. J T._*A Preliminary Investigation of Two 
Procedures for Setting Examination Standards" Educational and 
Psychological Measurement. 1976. v. 36. no. 1, pp 45-50. (fteport^ 
of a small -*cale experiment comparing Nedelsky's method and Ebel's 
method.) 

Angoff, W. H, Scales. Norms, and Equivalent Scores. In R. L Thorn- 
dike (ed.). Educational Measurement, Washington. D C, American 
Council on^ Education, 1971, pp. 514-515. (Source document for 
Angoff s method.) f ^ 

Berk, R. A. "Determination of Optimal Cutting Scores in Criterion -Ref- 
erenced Measurement/' Jourhal of Experimental Education, 1976, 
v: 15, no 4. pp. 4-9. (A method based on the comparison between 
instructed and uninstructed students.) 

Brennan. R. L and Lockwood, R. E. "A Comparison 'df the Nedelsky 
and Angoff Cutting Score Procedures Using Generalizability 
Theory/* Applied Psychological Measurement. 1980, v. 4. no. 2, 
pp. 219-240 

Bunda. M: A and Sanders, J. R.. eds. Practices and Problems in Com- 
petency-Based Measurement, Washington D.C.. National Council 
on Measurement in Education. 1979, Chapter IV, "Standards," pp. 
47-8"&.-Contains articles by R. M. Jaeger, L. A. Shepard, and L. E, 
Conaway. 

Chuang. D. T, Chen. J. J., and Novick M. R. "Theory and Practice for 
the Use of Cut-Scores for Personnel Decisions." Journal of Educa- 
tional Statistics. 1981. v/6, No. 2, pp. 129-152 (Mathematical for- 

. mulas, derivations, and proofs.) • ; 

Ebel. R. L. £ssent*aJs of Educational Measurement Englewood Cliffs,. 

N.J.. Prentice-Hall. l'972. p.p. 492-494. (Source document for \^ 
* Ebel's rnethod.) * ^> / 

^ f V 

Hambleton, R. ^'Test Score Validity and Standard Setting Methods 
In R, A. Berk (ed ). Criterion-Referenced Measurement' The State 
of the Aft Baltimore. Johns Hopkins. University Press, 1980. pp 
80-123.' 

69 




Huynh, H "StalisticaF Consideration of Mastery Scores " 
Psychomefnka, 1976, v 41. no 1, pp 65-78. (Mathematical theory 
for the contrasting-groups method.) 

Jaeger, Ft. M 'An Iterative Structured Judgment Process for Establish- 
ing Standards on Competency Tests Theory and Application " £du- 
cahonal Evaluation and Policy Analysis, in press. 4 

Journal of Educational Measurement,, 1978, v 15, no 4. Special issue 
« on standard-setting Contains articles by G. V 'Glass, N W Burton, 

M Scriven, R K Hambleton, J H. Block. W J. Popham, R L 

Lmn. and H M Levin. 

Koffler, S'L "A Comparison of Approaches for Setting Proficiency 
Standards" Journal of Educational Measurement, 1980, v. 17, no 
3. pp 167-17& (Report of a large-scale experiment comparing 
Nedelsky's metjiod and the contrasting-groups method ) 

Livingston, S A 'Choosing Minimum Passing Scores by Stochastic 
Approximation Techniques" Educational and Psychological Mea 
surement. 1980, v 40, no. 4, pp. 859-873 (Includes a detailed pre- 
sentation of the up-and-down method ) 

Livingston. S. A -Comments on Criterion-Referenced Testing/' Ap- 
plied Psychological Measurement. 1980, v. 4, no. 4, pp. 575-581 
x 

Meskauskas. J A and Norcini, J J. ''Standard-Setting in Written and 
Interactive (Oral) Specialty Certification Examinations " Evaluation 
and the Health Professions. 1980. v. 3, no 3, pp 321-360. 

Nedefsky. L. ' Absolute Grading Standards for Objective Tests " Educa * 
Uonal and Psychological Measurement, 1954, v 14. no 1, pp 
3* 19, (Source document for Nedelsky's method.) 

Popham. W. J Modern Educational Measurement Englewood Cliffs v 
N.J.. Prentice-Hall, 1981. (See Chapter 16, "Setting Performance 
. Standards "pp. 371*399.) 

Schoon. C G . Gullion, C. M.. and Fewrara, P "Bayesian Statistics, 
Credentialmg Examinations,, and the Determination of Passing 
Points.* Evaluation and the Health Professions. 1979, v 2, no 2, 
pp. 181*201 

Shepard. L. "Standard Setting Issues and Methods 4 " Applied Psycho 
logical Measurement, 1980, v. 4, no. 4, pp. 447-467. 

Skakun. E N. and Kling, S "Comparability of Methods for Setting 
Standards." Journal of Educational Measurement, l9$0, v. 17, no 
3, pp. 229-235, (Report of a small-scale experiment comparing 
Nedelsky's method and Ebel's method.) 



Appendix 

Additional Calculations Required by the . * 
Correction for Guessing 

The usual correction for g>uessing formula used with multiple-choice 
tests depends on the number of choices per question. If each question 
has five answer choices, four of them will be wrong answers The tradi - 
tional correction for guessing is to subtract one-fourth of the number of 
wrong answers*the test-taker chooses Similarly, if each question has 
four answer choices, three of them will be wrong answers, to correct for, 
guessing, subtract one third of the number of wrong answers the test- 
taker chooses. 

The Nedelsky. Angoff, and Ebel methods produce an estimate of the 
expected number of correct answers the "borderline 1 ** test-taker will 
choose To find the test-takers expected score, corrected for guessing, 
do the fpllowing calculations. 

t" Subtract the expected number of correct answers from the total 
number of questions to get the expected number of wrong answers, 

2 Divide this number by the numbe^f (-wrong answers per question . to 
get the expected number of penalty points. 

3. Subtract this number from the expected number of right answers, to 
get the test-taker's expected score, ✓ 
For example, suppose the test has ten questions and each question 
Has five answer choices, as in Table 1 on page 22 If the expected num- 
ber of correct answers is 4,43. you would do the following calculations , 
Expected number of wrong answers; 10 - 4.43 = 5.57 * 
Expected number of penalty points for guessing 5 57 * 4= 1.39 
Expected score, corrected for guessing: 4.43 - 1,39=3.04 



/ 



71 



